I decided to share my experience acquired at a million-user platform company. It was an exciting journey to build a platform that serves over a hundred engineers. Knowing these details would have saved us a lot of time and pain while implementing our MLOps platform and prevented our service reliability team from a few headaches.
Companies have varied approaches to the way they implement MLOps. Most of the time architectural decisions revolve around the needs of the product and the engineers using the platform. Moreover, growing MLOps capability is always a long-term journey: from implementing the basic pipelines, model concept, and drift monitoring to the data scientist needs that are waiting to be discovered. Also, improving MLOps can lead to replacing the entire existing platform or changing some major components within it.
I had the privilege to experience a big company with around a hundred machine learning engineers migrating their platform from mostly on-prem custom-built solution to the cloud. The most complicated decisions we made were the right platform choice for our needs and the decision to move both data and mlops platforms to the cloud. MLOps is tightly coupled to a data warehouse, so, obviously, that move affected the data team, mlops team, and service reliability team.
Therefore, the most important questions you should raise before implementing MLOps are:
Should we build a platform or use an existing one?
Can we live with a cloud lock-in?
How and where should we keep our data?
Where do we serve our models? Do we need a custom serving component?
What are our user needs? Do we need low-code or no-code machine learning?
How do we train our models? Do we use on-prem resources or cloud resources?
Which ML frameworks our platform should support?
How do we manage packages?
Do we need a feature store? If so, how do we manage our feature lifecycle?
How do we ensure our model's scalability and reliability?
I’ll cover all of them briefly in this post.
1. Using an Existing Platform vs. Building One
MLOps landscape today is vivacious with many companies offering great partial or end-to-end solutions. While choosing between using an existing one or building your own, you should consider:
- Your training needs. Can the platform of your choice work with GPUs? Do you need TPUs?
Unfortunately not every platform offering on the market does support training with GPUs. It might have great features and a very nice UI, but it is rendered quite useless if it does not support training on GPUs. Make sure to find that out if you are having a conversation with the seller. And if you really need the power of TPUs, your only choice today is GCP Vertex AI.
- Your serving needs. Do you need to serve on-prem? On cloud?
When it comes to serving you have multiple choices. You can use platform's own serving offering. However, this one is usually very opinionated and limiting. We found that having our own component running on FastAPI gave us an opportunity to track all the metrics we wanted in a format we wanted. Moreover, it was a highly independent component to develop that could be deployed on a K8s cluster anywhere: cloud or on-prem. Some things to consider — if you want to track data and concept drifts, you will need information about the test dataset (truth data) combined with inference input data and results. This can make your component more complex and dependable. However, there is a multitude of open-source tools widely used today that can be an alternative to your own built component. You can try Seldon Core which has data drift tracking included. So if you do not have enough time and development resources, you can integrate it into your MLOps workflow.
- Your resources. Do you have enough developers and experience?
While building your own platform components can bring a lot of flexibility, you should consider its future. You will become a maintainer and bug solver for it. Another alternative is to go with open-source tools and enjoy the help of the community. And, if you are really short on talent there are always enterprise solutions with tech support services.
2. Cloud Lock-In
While it is very important to go for the right cloud platform that would suit your training needs, avoiding it entirely might not be a smart idea. Today Google, Microsoft, and AWS are offering a multitude of cloud services that can make your MLOps fast and user-friendly. When picking the right one evaluate your company's needs clearly. All the major cloud platforms are offering APIs so you could create your own integrations for data access, feature store, and other custom components. Also, you can use services separately: training, serving, and feature store are all independent services.
However, there are some implementation differences. For example, AWS tends to be stable in its feature offering, supporting all the latest training libraries and bringing a modular approach to services. Also, its data preparing tool Data Wrangler is very well-developed. The downside of this stable modular service approach is a UI that is subpar. GCP took an integrated approach, connecting all the services in an intuitive UI which our users liked. However, during our platform operation, its service got an upgrade and we had to plan a migration to the new version. It brought more work for us, but our users still preferred its minimal UI. Nevertheless, we are still a very dynamic company with an entire team dedicated to MLOps, so it wasn’t a terrible hurdle. If you are short on development resources, stability might be something you should consider investing in.
3. Data For Training
You should aim at having your data for training in the same location as your platform. That would eliminate data transfer times. Having it on separate locations (on-prem and cloud or different cloud zones) will require you to perform data transfer before training. If you have no choice, you can use a VPN connection from premise to cloud, but it is very slow compared to using a cloud service for storing your data. Moreover, cloud services these days are very fast and data storage such as BigQuery can make loading your data 10–100 times faster, greatly reducing model development time.
4. Model Serving
You should aim to serve models at the same location your app is. If you serve your app on-prem, then you should deploy your online models on the same physical servers to reduce the inference time. Trying to serve models on another platform or different regions will make your requests travel over the wire, bringing tenths to hundreds of milliseconds of delay which will be noticeable for the responsiveness of your app.
A custom serving component can bring more flexibility for the way you want to track your data changes, model concept drift and metrics. However, open-source serving tools provide a lot of room for configuration too, and have less development costs.
5. User Needs
Make sure you explore your user needs well. If you have engineers that like low-code or no-code machine learning, AutoML offerings will help them to be faster. This is something that is quite time consuming to implement yourself but is already well-solved by the major cloud providers. Also, other significant players in the area, such as Dataiku, have automated training available.
However, if you are in a situation where using a platform is not available, consider supporting your users with reusable code snippets. This code snippet library is so simple and can go a very long way in making your engineers happy — no need to reinvent data queries, training job descriptions, visualisations, and metric tracking. It greatly aids knowledge sharing as well!
6. Model Training
When picking whether you should train on cloud or on-prem consider the human resources you have available. While it might seem more cost-efficient to buy hardware with GPUs and run it locally, it is not the full picture. You will need to take care of the hardware and ensure service reliability and GPU availability, meaning, to have a way to handle training jobs and a support team to ensure the machine is always healthy and ready for training.
There is a software you can install on your hardware that handles training jobs, however, it is almost impossible to reach the availability and reliability a major cloud provider can offer — cloud services provide an unlimited supply of machines with a vast selection of GPUs that are available 24/7 with tech support. This ensures that you will never run out of GPUs available while you will be experiencing growth in ML training jobs within your company.
7. ML Frameworks
Make sure the platform of your choice supports all the training frameworks (Tensorflow, Keras, XGBoost, PyTorch, R, etc.) you need. While major cloud providers do this pretty well, some products in the market might limit you. This is especially true if you need R or the latest version of Tensorflow — commercially available platforms have upgrade cycles, so your newest Tensorflow model might not be supported for another year or two. Make sure you clarify how often and to what version your used frameworks will be upgraded when you make a buying decision for a platform.
8. Package Management
Package management can be handled by a few tools that are available today, so make a choice that is optimal for your working environment. Aim for a unified approach for all the workflows in your company. Things to consider — the architecture of your platform and the workflow for model development.
The most common choices today are pip, conda, and poetry. I believe poetry has a great advantage when it comes to package versioning. It’s pyproject.toml file contains the information about the project and which dependencies are used for development and production separately. This feature has to be manually implemented for pip and conda environments, making it a hurdle to keep two different environments for each project. Moreover, trying another version of Python with poetry is a breeze — you can simply rebuild the environment for the project after changing the Python version in .toml file.
9. Feature Store
If you are just starting to implement your MLOps workflow feature store should not be your first worry. You can build and train using data directly from your warehouse. However, once you have your pipeline up and running, it does help to speed up the model development process by providing reusability of computed features.
Very often features, computed from initial user/item/service data are significant in many ML use cases. Consider this: you recognized items in pictures with a computer vision model. One data scientist wants to use that information for the recommendation engine. Another data scientist wants to create a fraud-detection model detecting illegal items. Very often they don’t know about each other's work because they are working in different domains within the company. Really, the possibilities to reuse this data are endless and you are saving multiple hours of data science work when you keep this data in the feature store. Something to keep in mind though: have a clear process to define new features and avoid feature duplication.
10. Model Scalability and Reliability
MLOps does cover an entire lifecycle from data preparation to models in production. No matter your serving choice, cloud or on-prem Kubernetes is usually a great way to go to serve your models with its’ load-balancing and automated scaling capabilities.
However, make sure you implement a proper CI/CD process. Test your models before deploying them into production. Is the new version of the model more accurate than the last one? Did the test inference return a proper data response? Is the new version of the model fast enough (some complex deep learning and recommendation models might be too slow for production use!)? If the model does not pass one of those requirements or integration tests fail, it is not suitable for production. Also, make sure to involve your service reliability engineers in the process.
Some models that are continuously retrained can be deployed automatically. However, be careful when facing the unknown. Consider always having a human in the loop to approve the deployment of new models into production. Very often data scientists only widely guess the resources that need to be allocated for the model. This can cause a newly deployed model, which passed all the tests, to crash. This can be quickly fixed by your shift reliability engineer. Also, keep in mind, that some models might use more resources to be loaded than to be in operation, so being more generous with allocated resources is always a safe choice. This is the reason why simply using autoscaling in K8s might not work for you. Know your models and keep the communication between reliability engineers and data scientists.
Last but not least — make sure you version all the models, and their pipelines and keep their copies in backup storage!
To summarise, your MLOps success depends on answering your user needs, automating as much as possible, speeding up the development process, and keeping your ML models scalable and reliable. The entire MLOps chain doesn’t end with deployment, models need to be continuously monitored and improved. It is very important to keep conversations among the teams responsible for data, model development, operation, and reliability of the application. Successful MLOps platforms serve their user needs and are well integrated with the data and serving infrastructure inside the company.