A Three-pronged Approach to Bringing Machine Learning Models Into Production
Eugene Rudenko, the AI Solution Consultant co-authored this Machine Learning development article with Vitaliy, Data Scientist at NIX United. The piece reveals a three-pronged method to putting ML models into production, as well as a commercial perspective.
Throughout this article, I will explain several ways to deploy machine learning models in production, based on our team experience. It should be noted that the main criteria for choosing these approaches are convenience, speed of operation, and completeness of functionality. In addition, I will describe the bottlenecks we encountered and what solutions we eventually applied.
Engineers in data science and MLOps will find this article valuable. With this material, you will be able to set up simple, fast, continuous delivery within ML.
In data science, sending ML models to production often remains in the background, as it is the last stage. Before that, data collection, selection of algorithms to solve problems, testing various hypotheses, and maybe experiments need to be done. The first time we see results and the problem has been somewhat solved, we understandably want to cheer “Hurray! Triumph! Victory!”—
However, we must still find a way to make the model work, and not within some Jupyter Notebook, but in a real application with real workloads and real users. Furthermore, the production phase implies two other factors as well. The first is the option to replace the machine learning model with a new one without stopping the application (hot swap). The second is to configure access rights to the model and run several versions of it simultaneously.
In our team’s projects, we have tried many approaches for models created and trained in various ML frameworks. I will focus on the variants that we most often use in practice.
We attempted to develop self-written web-applications by loading learned models into them before moving on to the serving tools. However, we ran across several issues with this variety. We had to cope with internal web-application multithreading implementations that clashed with ML implementations, as well as the initial loading of the models. Because this is a time-consuming process, the apps were not ready to be used right away. Furthermore, there were issues with numerous users working at the same time, access discrimination, and restarting of the application after training a new version of the model. These issues are now a thing of the past thanks to the use of specialized libraries.
This is most likely the best approach to interact with Tensorflow models. We’ve also used this framework to work with PyTorch models that were converted to Tensorflow using the ONNX intermediate format.
The main advantages of Tensorflow Serving:
– Support for multiple versions. It’s simple to set up operations so that, for example, many versions of the same model can run at the same time. You can do A/B testing or keep the QA/Dev versions running this way.
– The ability to shift out models without having to shut down the service. For us, this is a very useful function. We can place a new version in the model folder without halting the service, and Tensorflow will wait until the copying of the model is complete before loading the new version, deploying it, and extinguishing the previous one. Even if people are actively interacting with the model, all of this will go unnoticed.
– Auto-generating REST and gRPC APIs for working with models. This is perhaps the library’s most beneficial feature. There’s no need to create any services—access to all models is automatically granted. There’s also a technique for retrieving model metadata. When we implement third-party models and need to know the input data types, we frequently use them. We use gRPC when we need to speed up our job. This protocol is significantly quicker than REST.
– Working with Kubernetes and Docker. At the present, the Docker container is our primary method of working with Serving. Serving is loaded in a separate container, into which we copy the configuration file containing our model descriptions. After that, we add a Docker volume containing the models themselves. The same Docker volume is used in additional containers where we train new models as needed (we had the option to use it in Jupyter and a separate application). This system has now been thoroughly tested and is being implemented in a number of our projects.
– Scalability. This feature is still under study and we are going to use it in the future. In theory, Tensorflow serving can be run in Kubernetes (e.g. Google Cloud) and keep our serving behind LoadBalancer. So the load on the models will be shared across multiple instances.
Disadvantages of Tensorflow Serving:
– It’s tough to deploy a model that wasn’t built with Tensorflow (sklearn, lightGBM, XGBoost). You must write your own C++ code despite the fact that their extensions are supported.
– You should be concerned about security. For example, closing network access to Tensorflow Serving and leaving access open just for your services that will already implement authentication. We normally close all ports for the container running the service in our Docker deployment technique. As a result, models can only be accessible from other containers in the same subnet. This container bundle is well-served by Docker-compose.
When comparing Tensorflow to Pytorch, the latter did not have the ability to serve the model using something similar to Tensorflow Serving until recently. Even the official documentation demonstrated how to use the Flask service. However, with Tensorflow, you don’t need to do this as it automatically builds such a service. When it came to drawbacks, they were crucial for us as we began to learn about Serving. They are no longer relevant in the architecture we employ.
“In order to add some business context to the technical research that has been outlined, I will briefly share a few relevant cases.
We partnered with an innovative startup in the in vitro fertilization field to help them implement a number of prediction models, reach required optimization, and building a continuous deployment process. Since TensorFlow was utilized initially, we decided to build on top of the existing environment and incorporated the Tensorflow Serving method. The released product reliably predicts embryo quality without human intervention, though it keeps us busy with ongoing model enhancements.
How to Work in Data Science: AiThority Interview with Yashar Behzadi, CEO and Founder at Synthesis AI
Another example where we went with Predictive Model Markup Language was an AI-based advertising system aimed to maximize the outcome of ad campaigns through deep learning of customer profiles and personalized offerings. Since implementation required massive data processing we had to bring in our data engineers to build Scala-based data pipelines. Therefore the ability of PMML to produce Scala-ready ML models was the decisive benefit that led us to select it over other options.” — Eugene Rudenko, AI Solutions Consultant
Triton Inference Server
Another popular Nvidia model deployment framework. This tool allows you to deploy GPU- and CPU-optimized models both locally and in the cloud. It supports REST and GRPC protocols, and it has the ability to include TIS as a C-library on end-devices directly into applications.
The main benefits of the Triton Inference Server:
– Ability to deploy models trained with various deep learning frameworks. These are TensorRT, TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, and OpenVINO. Both TensorFlow 1.x and TensorFlow 2.x versions are supported. Triton also supports model formats such as TensorFlow-TensorRT and ONNX-TensorRT.
– Parallel operation and hot-swapping of deployed models.
– It isn’t just for deep learning models. Triton provides an API that allows you to use any Python or C++ algorithm. At the same time, all of the benefits of the deep learning models used in Triton are preserved.
– Model Pipelines. When several models are deployed and some are awaiting data from other models, the models can be integrated into sequences. Sending a request to a group of models like this will cause them to run in order, with data traveling from one model to the next.
AI ML Interview: AiThority.com Interview with Nicolas Gaude, CTO at Prevision.io
– Ability to integrate Triton as a component (C-library) in the application.
– Deployment. The project features a number of docker images that are updated and expanded on a regular basis. It’s a good approach for establishing a scalable production environment when used in conjunction with Kubernetes.
– A number of metrics allow you to monitor the status of models and the server.
We can say from experience that the combo of Triton Server + TensorRT Engine works well since this format allows models to be as productive as possible. However, at least two moments must be considered in this case. To begin, TensorRT Engine should be compiled on the device that will be used in the Triton deployment environment and has the same GPU/CPU. Second, if you have a custom model, you may have to implement the missing operations manually.
In terms of the latter, this is quite common when employing non-standard SOTA models. You may discover a variety of TensorRT implementations on the web if you want to use popular models—for example, in the project where we needed to train an object-detection algorithm on Rutorch and deploy it on Triton, we used many cases of PyTorch -> TensorRT -> Triton. The implementation of the model on TensoRT was taken from here. You may also be interested in this repository, as it contains many current implementations supported by developers.
PMML (Predictive Model Markup Language)
To be clear, PMML is not a serving library, but the format of saving models in which you can already save scikit-learn, Tensorflow, PyTorch, XGBoost, LightGBM, and many other ML models. In our practice, we used this format to unload the trained LightGBM model and convert the result into a jar file using jpmml transpiler. As a result, we received a fully functional model that could be loaded into Java/Scala code and used immediately.
In our case, the main goal of applying this approach was to get a very fast response from the model—and indeed, in contrast to the same model in Python, the response time decreased by about 20 times. The second advantage of the approach is the reduction of the size of the model itself—in transpiled form its size became three times smaller. However, there are disadvantages—since this is a fully-programmatic way of working with models, all the possibilities of retraining and substituting models must be provided independently.
In conclusion, I would note that the three options listed above are not silver bullets, and other approaches have been left behind. We are now taking a closer look at TorchServe and trying Azure ML-based solutions in some projects. The methods named here are what have worked well on most of our projects. They are fairly easy to set up and implement and don’t take much time. They can be handled by an ML engineer to get the solution ready. Of course, requirements vary from project to project, and each time you have to decide which ML deployment method is appropriate in a particular case. In really difficult instances, you’ll certainly need to work with MLOps engineers and, in some cases, design an entire ML combination of multiple methods and services.
[To share your insights with us, please write to email@example.com]