Note: this article was co-published with Fritz AI.
MLOps is a term you have probably started hearing increasingly often in the machine learning space — and for good reason.
MLOps is a set of techniques for putting models built by data science and machine learning engineering teams into production. It’s similar to DevOps in many ways, also centering on automation, collaboration between teams, and continuous processes for testing and delivery.
Industry awareness of MLOps is growing, as more and more businesses realize that the hacky, works-on-my-machine workflows that data scientists can use to quickly leverage newly built models internally do not scale to the large datasets, user demands, and business oversight that come with production deployments into enterprise products.
Unified experiment management
ML typically involves people from several different roles working together — data scientists, software engineers, analysts, and project managers. This means there are often many moving parts and interdependencies, and as a result, cross-team communication becomes especially important.
The first need most ML teams manifest is the need for a centralized model training and evaluation platform. Moving individual ML experiments away from individual contributors’ machines and into a shared communal space enables access for the entire team, facilitating the ability to easily bring up one another’s work whenever needed.
This is critically important for non-technical stakeholders and business leaders, who need to make on-the-fly decisions based on a project’s progress and potential. A shared space for viewing and reviewing experiments removes one or more steps from the communications processes that the front office relies on, allowing for better, faster business decisions.
Of course, there are benefits on the engineering side, too. Shared experiment tracking encourages best practices like code and model review, which allow teams to share expertise and build better models faster. Unified experiment management doesn’t require abandoning local workflows either — features like Spell’s Private Machine Types wrap experiments run on local machines and ones run on cloud machines with the same process, allowing model-builders to share experiments with the rest of the team across environments.
Automated training and comparison
As the number of experiments grows, so too does the importance of automation. Automation pipelines for training, optimization, testing, and delivery help prevent breakages and speed up iteration and time to production. This becomes critically important when the number of experiments run has become large, making the manual management of runs and resources unwieldy.
MLOps automation tools like DVC and Spell’s workflows feature build on version control systems like Git and established patterns from the DevOps world, like GitOps, to bring software automation to the world of machine learning model management. Core to these tools are the concepts of pipelines and lineage.
Pipelines are sets of processes, usually represented as directed acyclic graphs of tasks (popularized by tools like Apache Airflow), that represent the full flow of training a model, scoring it on a test dataset, and writing the results to a comparison tool for evaluation. In other words, pipelines are the how of model training automation.
Lineage, aka model lineage, is the inheritance chain of a machine learning model. Lineage encodes decisions made and lessons learned based on previous experiments in the chain of model-building decisions, and the difference between what was tried in the current experiment and what was tried in the past.
Lineage is essential to answering one of the core questions in machine learning development: the "have you tried X?" question. In large teams, it’s impossible for everyone to know everything that’s been tried already and what should be done next, without automation by tools supporting this need. In other words, lineage is the so what of model training automation.
Once a successful model has been created and validated by the team, it needs to be deployed. Model deployment is one of the most difficult problems in MLOps. Everything from form factor to deliverable to monitoring solution is different from product to product, due to differences in the business requirements the deployments need to satisfy.
Furthermore, for successful model development projects, model deployment doesn’t happen just once. Every time a fundamental improvement is made to model performance, and every time the model goes out of data, the model will need to be retrained and redeployed into production.
Business leaders can help set the deployment process up for success by establishing clear expectations about model performance on the task at hand. To give a concrete example, imagine a business process with a heuristic that emails users offer coupons for a certain company product, with a historical claim rate of 1% of all users emailed. An ML model would be a candidate to replace this process if it can show that, based on historical redemption data, it can reliably achieve a claim rate of >1%.
In this example, it is the job of business leaders looped into this process to communicate the business goal to the product team, and and it is the job of the product team to transform this into the model metric threshold (cross-entropy loss, hinge loss, etc.) needed to achieve it, and then communicate that to the engineering team.
As the project evolves and matures, specific processes are needed to put perfromant models into production. This process will likely involve human-in-the-loop review of model performance metrics for product stakeholders and engineering briefs for business stakeholders that need to rubber-stamp the deployment. Implementing unified experiment management and automated training and comparison will set you up for success here, speeding up human review processes by putting the necessary information at the presenters’ fingertips.
Teams with clear, high-confidence business objectives can go one step further, eliminating the need for human involvement entirely by automatically satisfying certain criteria into production right away. This expedited path to production is of great value to engineering teams, but requires a high degree of confidence in the deployment process, making this strategy a hallmark of a mature data science organization.
The same pipeline tools that can be used to automate model training and comparison can be used to automate deployment, too. Testing and deployment are two sides of the same coin: the same CI/CD principles that apply to DevOps pipeline tools like Jenkins apply to MLOps pipeline tools like Spell workflows, too.
For your deployment platform, choose a form factor that is suitable for your business needs (model API, edge deployment, etc.). That decision then informs your choice of platform. You want to choose a platform that satisfies your chosen deployment form factor while offering attachment points for other services (like model monitoring) and abstracting away low-level details that the platform can solve for you.
For example, for models deployed as an API endpoint, Spell model servers offer Kubernetes-backed resilience and autoscaling with a simple, easy-to-use form factor: a simple Python class file and some command-line instructions. For models deployed to mobile, Fritz AI seamlessly handles multi-device deployment (iOS, Android) for you.
To learn more about the model serving form factors available to you and your team, and tips on how to implement them effectively, check out my previous article: "MLOps concepts for busy engineers: model servers".
Productionizing ML doesn’t end at deployment. Production models need to come with extensive logging and monitoring attached. This is again a familiar need from the DevOps world—which stresses monitoring service health—that takes on even greater importance in MLOps.
One reason why is the problem of concept drift. Concept drift is the phenomenon whereby a machine learning model, left to its own devices, eventually falls out of step with its inputs, serving increasingly bad or inaccurate responses. This is because model performance slowly degrades over time, even when left untouched, due to natural changes in the data stream.
The exact nature of this degradation, and the rate at which it occurs, is highly domain, problem, and data specific. For example, changes in user behavior might cause degraded model performance slowly, over long periods of time. Sudden changes in the data provided by a downstream vendor, meanwhile, might change model performance drastically, and all at once.
Guarding against concept drift (and more mundane problems like hardware performance regressions) requires a well-developed monitoring solution. There are a large variety of tools, like Grafana, and techniques, like health checks and canary deployments, that can help here.
Business leaders hoping to guide products using machine learning models to and through successful deployments can help by (1) being conscious of the need to implement a model monitoring solution as part of the scope of work of the project and (2) working with the product team to define performance and response time characteristics (so-called service-level objectives, or SLOs) that the service will need to maintain.
For example, a model endpoint that serves shopping recommendations to live users on the company’s website might have the following SLOs:
- The 95th percentile response rate does not dip below 2 seconds.
- Rolling model accuracy on the most recent 1000 user recommendations served is never below 95% of test set accuracy.
The product and engineering teams can then work together to define what alerting mechanisms need to be put into place to try to meet these objectives on an ongoing basis. For example, the engineering team might configure the system to send out a PagerDuty alert whenever rolling model accuracy dips below 97% of test set accuracy, or automatically roll back the model when it dips below 95%.
The precise configuration and nature of these checks is something the engineering team needs to decide for themselves, and having clear, well-defined service performance objectives goes a long way towards enabling them to do just that.
MLOps is an emerging set of needs for organizations building machine learning models and incorporating them into their product offerings. Much like DevOps before it, MLOps is coming into vogue largely based on the increasing awareness that model development and deployment are intrinsically linked. These processes need to be considered as connected in the both internal engineering process and in the larger business context.
Business and product leaders can help set their teams up for MLOps success by establishing clear MLOps strategies for their product offerings and choosing a platform (or platforms) that addresses these needs and asks. Unified experiment management, automated training and comparison, automated deployment, and automated monitoring are all critical components of the machine learning product lifecycle—all of which require clear project objectives from business leaders and product stakeholders responsible for bringing these products to end users.