Even if you’ve never touched an easel, you’ve probably heard of Bob Ross, the legendary instructor behind the iconic 90s PBS show “The Joy of Painting”. This popular program introduced the world to his fast, easy-to-learn “wet-on-wet” painting style. Each episode of “The Joy of Painting” walked the viewer through one or two Bob Ross paintings, from start to finish, in just thirty minutes.
Bob Ross’s paintings have a soft, almost mythical texture to them, with bubbling brooks, stoic mountains, and towering fir trees. He still looms large in popular culture: The New York Times recently did a where-are-they-now expose on his work, and those looking to learn how to paint like he does continue to congregate on fan communities like TwoInchBrush.
Inspired by his work, and by recent advances in image-to-image translation networks, I trained a GauGAN deep learning model that generates landscape paintings in Bob Ross’s style. Provide a pixel segmentation mask (this pixel is part of a tree, this one of a mountain, or a rock) as input, and receive a painted canvas as output:
You can play with the model yourself by visiting our demo web app at Paint With ML. I encourage you to try it out. In this blog post I will show you how it works!
Getting the data
The first challenge was getting the data. Because we are doing image-to-image translation we actually need two different kinds of data: the raw pixels of Bob Ross paintings, and the segmentation maps for said pixels.
Luckily the paintings themselves are easily available: an index is available in an easily downloadable form is available on GitHub. However, there was no publicly available corpus of Bob Ross painting segmentation masks.
So I built one myself. Each Bob Ross painting took anywhere from 30 seconds to 5 minutes to annotate. In total, I spent 14 hours annotating a subset of approximately 250 images using the Labelbox annotation tool. Here are a couple of example masks:
Getting the model
The current state-of-the-art in computational image generation is a convolutional neural network model known as GauGAN. GauGAN, a project of NVIDIA Labs, takes a semantic image mask — a set of pixel-by-pixel class labels — as input, and produces a label-matched image as output. NVIDIA has demonstrated some very impressive results with a version of this model trained on a million Flickr images:
GauGAN is an example of a generative adversarial network (GAN). In a GAN, two networks face off against one another: a generator network responsible for generating fake data, and a discriminator network responsible for distinguishing between real images and fake ones. As training proceeds, the generator network learns to generate more and more realistic-looking fake images at the same time that the discriminator network gets better and better at sniffing them out. Given enough time and data, the generator learns to generate highly realistic-looking images.
Training the model
Instead of using GauGAN to generate photo-realistic landscapes we’re going to use it to generate style-realistic paintings. And instead of using millions of Flickr images, I used my comparatively tiny corpus of 250 Bob Ross images.
The first model that I trained is a model built from scratch on just the Bob Ross paintings. Because the dataset is so small, I did not expect great performance, but it’s important to establish a baseline model against which to compare later results. Three hours of training in a V100 later:
We can improve on the performance of this baseline model using transfer learning.
One of the seminal scene parsing benchmarks is the ADE20K dataset, a collection of 20,000 images labeled across 150 different classes. NVIDIA never released the Flickr data they used to train their photo-realistic model, making the ADEK20 corpus represents the most comprehensive landscape images corpus I could get my hands on.
For Bob Ross’s paintings I narrowed the classes of interest down to just nine total categories: sky, rock, mountain, grass, plant, tree, water, river, sea. ADE20K is a general-purpose dataset containing a mixture of indoor and outdoor scenes, so fully half of the images in the dataset are indoor scenes with no landscape content. Another 5,000 or so images had only marginal overlap with our labels of interest. That still leaves us with 5,000 photos containing classes of interest —20x the size of the Bob Ross corpus!
I christened this subset of data ADE20K-Outdoors. The idea is that I would train the model from scratch on the ADE20K-Outdoors dataset, then fine-tune it on Bob Ross paintings. That way the resulting model would hopefully get all of its structural information (knowledge about things like clouds in the sky or trunks on tree) from the robust ADE20K-Outdoors dataset, and get all of its stylistic knowledge from the comparatively small but visually distinctive Bob Ross dataset.
However, I quickly ran into a problem. GANs are legendary for their hunger for compute, and GauGAN is even hungrier than most. If you try to train this model at home on your one GPU, it’d take forever to get it to converge.
Enter Spell. Using Spell’s machine learning management SDK I was able to get my model training script running on a gigantic V100x8 GPU server on the cloud:
As you can see, training a state-of-the-art generative adversarial network on this medium-sized image dataset took half a day of compute time. Feeding this model brand new segmentation masks from the test set produces outputs like the following:
The images have an odd but pleasing affect to them — a bitmappiness that reminds me of 1990s-era video game backgrounds.
Notice that shadows are accurately reflected in the surface of the water, and that there is some pleasing variety in the colors of the mountains. On the other hand the model is unsure of what to do when it is given a large swatch of pixels having the same segmentation mask (a common problem for GANs), and struggles with making realistic-looking trees.
Finally we get to the last step: transfer learning. I froze all but the last convolutional block of the encoder and the last deconvolutional layer of the decoder, and trained the ADE20K-Outdoors GauGAN for five epochs on this dataset with 1/10th of the default learning rate. This allowed the model to incorporate some of Bob Ross’s signature style — his color palette especially — into the model output:
Deploying the model
I used Spell’s new model serving feature to deploy my trained model as an API endpoint. Model serving is performed on Kubernetes, and once I’d done the necessary AWS cluster configuration, getting the server up and running boils down to running a command that looks something like this:
# turn the training run into a model artifact $ spell model create bob_ross runs/$RUNID # turn the model artifact into an HTTP endpoint $ spell server serve \ --min-pods 1 \ --max-pods 5 \ --pip-req requirements.txt \ paint_with_ml:v1 \ server.py
This configures a model server on CPU that autoscales between one and five pods (model instances) as needed based on demand. Python packages to be installed prior to launch are specified by --pip-req. server.py is the model server file; this is a Python file containing a model server class that has __init__ and predict methods defined on it.
The web app
The final piece of the puzzle is the web application. Paint With ML is a simple single-page application, built in React, that enables anyone on the Internet to try the model out, “painting” their own model-assisted landscape on the web.
Users apply a combination of three different tools (brush, fill, and eraser) and nine different semantic brushes to to a 512x512 HTML canvas. Clicking on the run button maps the segmentation to a base64 encoded PNG data URI, packs that into a JSON payload, and makes a POST request to the model server I stood up in the previous step. Once the client receives a response, it displays the resulting image to the panel on the right side of the screen.
The web application is completely static, making it very easy to serve. We used AWS S3 Static Hosting to make it live.
This was a really fun project to work on that demonstrates the end-to-end nature of the Spell platform. Every step in the model lifecycle was done on Spell — training the model in Spell runs, saving in Spell resources, debugging in Jupyter in Spell workspaces, and serving for inference using Spell model servers.
Try it out for yourself by visiting http://paintwith.spell.ml/.