MLOps concepts for busy engineers: model serving

The final step to a successful machine learning model development project is deploying that model into a production setting. Model deployment is arguably the hardest part of MLOps ("Machine Learning: The High Interest Credit Card of Technical Debt"), and also probably the least well-understood.

Model training is relatively formulaic, with a well-known set of tools (pytorch, scikit-learn, xgboost, etc.) and strategies for using them. Model deployment is the exact opposite. Your choice of deployment strategy and infrastructure is inextricably tied to user expectations, business rules, and technologies you are already using at your organization. As a result, no two model deployments are the same.

In this article I will focus on the likely first model deployment decision you will have to make: your model serving strategy. Broadly speaking, there are four different common model serving form factors:

  • Offline serving (aka batch inference)
  • Model as a service
  • Online model as a service (aka incremental learning or active learning)
  • Edge deployment

This article is a quick introduction to these strategies for technical readers new to MLOps.

This list is not exhaustive, merely the one that should be familiar to most machine learning engineers. In future articles, we will dig into these strategies in more detail and showcase how they would work on our end-to-end machine learning platform, Spell.

Offline serving

In the offline serving paradigm, the end user is not exposed to your model directly. Instead, the model is scored on your records ahead of time (data scientists usually refer to this process as "batch inference"), and those results are in turn served to your end user.

Offline serving is exceedingly simple to implement. This is its greatest advantage. It is quite easy to run a batch inference job on your test dataset, cache the results to a high-performance database (like Redis or memcached), and then serve results from that database to end users. This is, essentially, how the Netflix recommendation algorithm works.

If your end users are internal to your organization, this is even easier. In this case, you don’t need a database: you can just write the test results to some S3 objects (or similar) and point users for those files for their needs. This works for applications in which long lead times are acceptable, e.g. scoring users for credit card pre-approvals.

The greatest disadvantage of offline scoring is the fact that it is a so-called cold deployment. The data that you will score on needs to be available ahead of time, and your end users need to be accepting of long lead times. As a result, offline scoring typically only works for "push" workflows — cases where the end user accepts requests from the model server, but doesn’t make them.

It is sometimes possible to use offline serving for "pull" workflows — cases where the end user makes requests to the model server. This is tricky to do because end users typically have expectations about response times (e.g. I want this API request to return a response in 5 seconds or less), which an offline model (by its very nature) cannot meet. With some applications, it is possible to cache every possible model response ahead of time. If that works for your use case, offline serving is a possibility. Again, Netflix recommendations are again a good example of a product that uses this strategy.

The biggest challenge to scaling an offline serving strategy is scaling scoring on the test dataset. At this time, for greenfield projects without preexisting hard dependencies, I highly recommend using Dask, potentially accelerated using NVIDIA RAPIDS, for this. Check out my previous article, “Getting oriented in the RAPIDS distributed ML ecosystem”, to learn why.

Model as a service

Deploying a model as a (micro)service is the most common model serving strategy in production settings. In this paradigm, an interface to the model is exposed to clients as an API endpoint (REST or otherwise). Clients make POST or GET requests against the endpoint to get what they need.

This is a flexible deployment strategy for building responsive, scalable model services, one which closely mirrors the deployment strategy (and technologies) you already use for deploying your existing software stack. As a result, most software engineers are already intimately familiar with this deployment strategy.

There is growing consensus in the MLOps community that Kubernetes is the deployment target of choice for model deployments using this strategy. Kubernetes solves a whole host of problems for you:

  • It can scale your service up and down (launching and spinning down model server pods as needed) automatically based on demand.
  • It allows for zero-downtime model upgrades (using rolling deployments).
  • It provides resilience — model server pods failover and restart automatically.

Additionally, your DevOps team is probably already using Kubernetes in at least some parts of your stack, so there is already a ground-spring of knowledge about managing such a service that you can borrow from and use.

However, Kubernetes is also extremely complicated. The most popular low-level on-Kubernetes SDK, Kubeflow, deploys 32 different services out of the box (you can see a list of them here).

If you are an extremely large enterprise, or the model is critical to your line-of-business, it might make sense to eat that complexity cost. For everyone else, there are a variety of Kubernetes-backed model server SDKs that abstract away most of the low-level details.

For example, Spell’s model server feature lets you deploy models to production using a simple Python class with __init__ and predict methods, a model artifact, and a spell server serve console command.

Online model as a service

For most use cases, deployment in an offline or a model-as-a-service mode is sufficient. However, if you want your model endpoint to be a little bit more ✨magical✨, online model-as-a-service deployment is an option.

A model is said to be online if it learns from user input automatically. The canonical example of online machine learning is a neural network which trains on a batch size of 1. Each time a user makes a request to your model endpoint, a full backpropagation pass is kicked off based on that input, updating the model weights simultaneously (or asynchronously) with serving the request.

This technique is most commonly associated with deep learning, although it can also be done using some "classical" ML methods, using e.g. the partial_fit API in scikit-learn.

One advantage of online learning is that it mostly eliminates the need for retraining.

Model services deployed using a model with conventional fixed weights will gradually go out of date, due to changes in user behavior and/or in the data stream itself. This problem is known as concept drift, and it necessitates occasional retraining and redeployment of your models. The frequency with which model retraining needs to be performed is extremely problem specific. A credit card scoring model might go on for months before going out of date. On the other hand, because of the cold start problem, the Netflix recommender algorithm is retrained every few minutes.

Because online models learn as they go, they greatly reduce the cadence with which redeployment is required. Instead of adapting to concept drift at deployment time, the model adapts to concept drift at inference time, improving the perceived performance of the model for end users.

Another advantage is that online algorithms are perceived to be more responsive by end users. A user that provides model inputs in an area the model doesn’t know very well will very quickly see the performance of the model on their particular subset improve, as online learning kicks in and rectifies the algorithm’s performance in that region of the problem space. This is particularly valuable in domains where the range of possible user inputs is extremely large, making the typical level of experience encountered by the model relatively shallow.

The disadvantage of online learning is complexity. To deploy a model server online you need to solve many !!FUN!! new problems:

  • Online algorithms are susceptible to catastrophic forgetting. If the model does not see enough inputs in part of its domain, it may begin to forget what it  learned there, causing long-term performance regression.
  • Online algorithms are susceptible to adversarial attacks. While no widely publicized such attack on a production service has occurred yet, in my opinion, given what’s been demonstrated in the academic literature, it’s only a matter of time.
  • Online algorithms are hard to train. The smaller the batch size, the more unstable the learning. Online models take much longer to converge, and require much more aggressive use of regularization, with negative implications for model performance.
  • Initialization becomes more complicated. Any new model server instances that spin up will need to read their weights from a peer, or from a recent backup, instead of from a static weights file.
  • If your model server is scaled to multiple instances, the different model copies will slowly go out of sync, requiring occasional re-synchronization of weights.

These are some very hard problems to solve. As a result, online model deployments are rarely seen in production. It requires a ton of expertise and fine-tuning to get right, on both the data science and data engineering side, and is probably only worth it for critical line-of-business applications with high monetary impact for the company

Edge deployment

The next and final category of model serving strategies I will cover here is edge deployment.

Every model serving strategy we’ve outlined so far is based on the client-server architecture. An intrinsic property of this design is the need to move data from the client to the server and back again. This creates an attack vector: if network security gets compromised, user data can be intercepted and tampered with in-flight. It is also impossible to use when offline.

One way to solve these problems is to move away from using a server completely, and instead serve the model right on the client device. This is called edge deployment because it moves the computation off the server and onto the edge (the client device).

Edge deployment is tricky because the hardware available to the client (a web browser, a user computer, or a user mobile device) is extremely limited. However, support is advancing rapidly on both the hardware and software fronts. Technologies like Apple’s Core ML and Google’s TensorFlow JS provide high-level, first-party SDKs for running machine learning model inference across client platforms. On the hardware front, modern desktop GPUs and CPUs and mobile SoCs ship with machine learning inference features (like tensor cores) burned right into the chip.

High-level SDKs like Fritz AI target these platforms, helping to ameliorate one of the traditional pain points of edge deployment: multi-device deployment.

For model serving scenarios where client-side deployment makes sense, with models which are moderate in size, and for applications which are not highly response time sensitive (inference on an edge device will take much longer than inference on a GPU server), edge deployment is a strategy worth considering. Plus, it’s fun!

A couple of additional things you will want to be aware of if you choose to go this route:

  • Since model updates now means shipping a file to a heterogeneous network of devices, you will need to think carefully about your deployment strategy.
  • You will need a policy on minimal device support, and a strategy for client devices that fail that test.
  • Hybrid models are a thing. For example, Google Home Voice will use a model-as-a-service endpoint if it is connected to WiFi, but will switch to on-device inference when offline.


In this article we covered the four machine learning model serving form factors that every machine learning engineer should know. Needless to say, there are a lot of things to consider!

Again, I will note that this article is not exhaustive. For example, one deployment strategy omitted here is federated serving: an interesting strategy involving models trained piecemeal manner. This has robust security implications, but is still mostly a research curio.

Stay tuned for future articles discussing how these deployment strategies work on Spell. In the meantime, if you enjoyed this article, here are some other articles on the Spell blog you may like:

Ready to Get Started?

Create an account in minutes or connect with our team to learn how Spell can accelerate your business.