What does the machine learning workspace of the future look like?

Every data scientist reaches the point where the quick-and-dirty works-on-my-machine workflows that work for simple side projects start to fail in the face of increasing model complexity. In my experience every data science team eventually realizes the need to set up team-wide model tools and processes.

Having worked in and around the machine learning tooling space for a couple of years, I think it’s really important to invest in a well-thought-out model experimentation workspace before you think you need it. And you should definitely do so before you start building machine learning products stitched together by buggy Flask apps and one-off Jupyter widgets! In particular, I think that there are five essential features of a comprehensive model experimentation environment that every team training machine learning models should have:

  1. Cloud notebooks for evaluating models;
  2. Cloud scripts for training them;
  3. Model monitoring for tracking them;
  4. Model lineage for communicating them;
  5. And model servers for demoing them.

In this blog post I’ll discuss what these tools are and why you need them, and show how I incorporate them into my own workflow.

Cloud notebooks for evaluating models

No matter what the TensorFlow marketing materials try to tell you, there is no such thing as an "off-the-shelf" neural network model. Neural networks are highly complex and ridiculously fiddly: they can fail in innumerable different ways, almost all of them really hard to debug. Andrej Karparthy said it best:

A "fast and furious" approach to training neural networks does not work and only leads to suffering. Now, suffering is a perfectly natural part of getting a neural network to work well, but it can be mitigated by being thorough, defensive, paranoid, and obsessed with visualizations of basically every possible thing.

Jupyter notebooks are the standard for the model evaluation and hypothesis generation part of your workflow. Notebooks are just the most convenient and easy-to-use coding environment we know of.

However, not everyone has made the leap yet to the next logical step: running Jupyter notebooks on the cloud. This has two big advantages:

  • Flexibility. You’re no longer limited by or tethered to your local machine. You can scale the compute backing your notebook up or down as needed, including launching notebook instances on machines with eight GPUs onboard, or launching them on multiple different machines. My personal favorite is abusing high-RAM instances for those brute force for-loop-on-entire-dataset-at-once jobs.
  • Shareability. Cloud notebook products make it easy to share notebooks with the rest of your team by e.g. sharing a link. No intermediate steps (e.g. pushing to git) required.

Cloud notebooks do have one disadvantage: cost. Running notebooks on the cloud is more expensive than running notebooks on your local machine. At time of writing, the cheapest GPU instance available on AWS for example, a T4 (g4dn.xlarge), costs 0.526$/hour. The big cloud providers have not been able to get the cost of GPU instances down to the ridiculously low levels we’re used to for CPU instances. However as far as corporate expenses it’s not that big of a deal — 40 hours (one work week) of time on a T4 is still just 21$. That’s about the cost of a lunch break in Manhattan—and you can go even lower using spot instances.

As far as features, a few things I consider important:

  • Easy scale-up and scale-down. It should be easy to switch to a different compute instance, or edit the files and packages that get loaded in, without having to create an entirely new instance.
  • Sensible default environments. You should have access to a base environment with the usual support packages ready to go. A PyTorch environment should have torchvision; a TensorFlow environment should have tensorboard.
  • Training in the compute environment should differ as little as possible from training on your local machine.

Cloud scripts for training models

Training non-trivial machine learning models inside of Jupyter notebooks is an anti-pattern.

Jupyter notebooks have two major weaknesses: the fact that they have a ton of hidden state, and the fact that they are intended to be run in an editable execution environments.

Because you can run notebook cells in any order, and then proceed to delete and recreate and rearrange cells as much as you’d like, the state of a notebook you’re halfway through editing is a complete unknown. You’ll have variables that you’re relying on that were defined in cells that you’ve since deleted, or libraries you’re using that are imported three cells down from where you are using them. If you try running a notebook you’ve been editing for a while from the beginning again, it’s almost guaranteed that something will break. 

The problem is bad enough that it formed the basis for the most-discussed talk of JupyterCon 2018: “I Don’t Like Notebooks”.

To avoid breakage, best practices for training a model inside of a Jupyter notebook is to restart the kernel and rerun the entire notebook from scratch. If you do this, it becomes easiest to just isolate all of the model code in a single cell block, so you only have to run that one cell to restart or debug model training. At this point, you’ve basically reinvented Python scripts.

Meanwhile, updating or modifying your notebook environment risks breaking all of the previous model training code in your notebook environment by switching you to incompatible versions of your code dependencies. The more complex the model, the greater the risk that future-you will have to debug C pointer exceptions in libraries you’ve never heard of. Certain types of errors can even blow up your Python kernel — good luck debugging that.

Jupyter notebooks are a great environment for model evaluation but for actually training your models use scripts instead. Personally, I use the following workflow:

  1. Prototype a new version of the model in a Jupyter notebook running on a cheap GPU instance.
  2. Isolate the model training code to a single cell block. Restart the notebook kernel and re-execute just this cell. To ensure code correctness, I like to train the model for at least one full epoch.
  3. Copy the model code to a Python script (you may need to make some minor adjustments so that the code works when run out of a script). The %%writefile magic command lets me do this right in the notebook.
    You can name the file if you’d like, but personally I find it easiest to just assign each model an ascending sequence number. See this Gist for a quick example.
  4. Execute the model training script from the notebook. Using Spell:
    spell run --machine-type V100x2 --pip torch\>=1.0.0 --pip scikit-image --pip torchvision --mount uploads/bob_ross_segmented_v4:/spell/bob_ross_segmented 'cd models; python model_12.py' .
  5. Go do something else.
  6. Once the job is finished, download the model artifact to my notebook environment with e.g. spell cp runs/62/checkpoints/ /tmp/checkpoints.
  7. Evaluate the model results. Prototype a new version of the model and restart the process all over again.

The notebooks in the spellml/paint-with-ml GitHub repository demonstrates this workflow in action.

Model monitoring for tracking models

Once you’ve started a long-running machine learning training job it’s important to keep tabs on it. In a free environment like Google Colab you don’t necessarily care that much, but when you’re using compute that you are paying for the ability to manually terminate training is extremely important, lest you accidentally leave your expensive training job running overnight (Andrej Karparthy has a story about accidentally training a model to state-of-the-art over holiday break).

This form of inline model monitoring requires two forms of tracking. You need hardware tracking to ensure that your model is using the compute resources allotted to it correctly, e.g. that it’s saturating all of the available GPU and CPU resources. And you need model metrics tracking to ensure that the model loss is actually improving, e.g. you’re not just using up compute cycles without improving the model.

Here’s how this looks on Spell. Hardware utilization metrics are available on the run summary page; model metrics are available via an integration with Tensorboard:

Your baseline for model monitoring should be real-time tracking of your model losses. Model loss is the attribute of your model training history which most informs the adjustments you could make to improve model accuracy. However, as the model continues to grow in size and complexity there is more and more side-channel information that you will want to track. If you are training a GAN model to synthesize images for you, you’ll almost certainly want to include side-by-side comparisons of real and fake in your logs. If you are concerned with a classification model’s tendency to confuse certain pairs of classes, you will want to start logging model performance on those classes. And so on.

The most important feature for your model monitoring tool is flexibility. It needs to be easy to add new model evaluation components to your decision matrix. The velocity of a team working on a model is such that new evaluation components will need to be added on a roughly weekly basis. These are why model monitoring tools like TensorBoard, Weights & Biases, and Streamlit are so great: they provide a suite of visualization tools which are easily interchangeable, mostly error-proof, and (best of all, from the perspective of a data scientist) native to Python. This is a huge step up to writing your own custom model monitor in e.g. Flask, and having to constantly rewrite and refactor it as needs change evolve and new visualization types are needed.

Model lineage for communicating them

One attribute of the machine learning training process which is sometimes overlooked is the need for exposition. Models are not built in isolation: they require constant back-and-forth communication within the team building them. This is especially true at the end of a project, when the time comes to present the results to stakeholders.

Supporting this need requires incorporating easy access to your model lineage — the sequence of models you trained over the course of the project — into your best practices. For example, training our GauGAN model to a well-fitted result required 15 different model training runs:

How discoverable is this history for the rest of your team? It should be easy for teammates who have never seen your model before to pull up your model training history and understand the sequence of steps you took to get from your initial dumb baseline to your current best-fit result, preferably without requiring any input from you. Spell provides a couple of features targeting this need:

  • Model training runs can be assigned labels, which enables searching through model history by label.
  • Model training runs can be assigned free-form notes explaining what the model is and how it differs from previous ones.

Other tools provide even more extensive support for this need. In CometML for example model training runs are grouped together under a workspace, which provides handy summary tables and visualizations for understanding model performance over time at-a-glance.

Model servers for demoing models

The fifth and final need for a data science team building machine learning models is a model server — a tool you use to deploy a model to a web endpoint.

Model serving in general is a highly complicated topic, one that sits at the intersection of the data science teams concerned with training models and the data engineering teams concerned with deploying them to production. However, as a data science team you don’t need fancy features like canary releases, A/B testing, or liveness probes right away. You need a way of deploying your model internally to the rest of your team, something that allows other members of your team to evaluate your results for themselves.

The needs for such experimental model deployment are vastly different from those of a robust production deployment. Deploying the model should be as simple as possible, performance is irrelevant, and monitoring can be basic to non-existent. Cross those bridges when you get there.

Model serving tools are the middle of a renaissance right now, with new platforms touting this feature coming out all the time (Cortex and Kubeflow are two prominent examples). As it so happens, we (Spell) are releasing our own model serving product just this week 🙂 — you can play with a demo webapp showcasing it at https://paintwithml.spell.ml.

These platforms differ in feature-set, but not in implementation: pretty much every model serving API I’ve seen is a frontend to a Kubernetes cluster running on the cloud.

For one thing, Kubernetes won the compute orchestration market. For another, Kubernetes gives you obvious insertion points for production serving features (the aforementioned canary releases, A/B testing, liveness probes) when you need them.

It’s an unfortunately common industry practice for data science teams tasked with model training to chuck their finished model code “over the fence” at wholly separate data engineering teams tasked with model productionization. This is a form of a siloing — it introduces communications barrier between teams, often necessitating hours-long meetings, what Lyft Engineering refers to as “Feature War Rooms”, to build the shared understanding necessary to get it working.

Handing over a working Kubernetes model server instead gives the data engineering team an obvious starting point for testing the model out themselves and an organized way to go about attaching the bits and baubles necessary for production serving of the model. Data scientists get an easy-to-use and ergonomic high-level API for taking their models to the experimental hey-try-out-this-new-model-build-please phase. Data engineers get a working model artifact on the same robust computing backend they use in production. Everybody wins.


It’s an open secret among data science practitioners that the tools and best practices in the industry are still aeons behind those in software engineering. Because the industry is still relatively new, rigorous best practices are not yet fully developed or followed.

However, I believe that over the next couple of years we will see a continual push towards bringing the engineering mindset to data science. The tools that we use will continue to evolve, get better, and ultimately converge to a product shape and a set of best practices based on learnings from productions usecases. In this blog post I’ve outlined my own personal hot take 🔥 take on what this future world would look like, focusing on five things that I think will drive the future of production machine learning: cloud notebooks, cloud scripts, model monitoring, model lineage, and model servers.

Ready to Get Started?

Create an account in minutes or connect with our team to learn how Spell can accelerate your business.