Model Servers

Concepts

Model servers allow you to deploy a trained model as a web API. Using the spell model and spell server commands, you can create a model from a set of resources and deploy model to an HTTP endpoint.

Quickstart

This quickstart example will put up a model server which uses a CNN model trained on CIFAR10 dataset to classify images.

Create a model serving Kubernetes cluster

Spell model servers are deployed on Kubernetes which allows model servers to scale based on either resource consumption or number of requests.

AWS

The following third-party dependencies are required to setup and manage the EKS cluster used to serve your model.

  • kubectl: CLI tool for managing Kubernetes clusters
  • eksctl: CLI tool for managing EKS clusters
  • aws-iam-authenticator: Utility that allows for authenticating kubectl with EKS clusters via IAM

Once these have been installed, additional Python dependencies required by Spell must be installed.

$ pip install --upgrade 'spell[cluster-aws]'

To create the EKS cluster, run

$ spell cluster create-kube-cluster

GCP

The following third-party dependencies are required to setup and manage the GKE cluster used to serve your model.

  • kubectl: CLI tool for managing Kubernetes clusters
  • gcloud: CLI tool and SDK to interact with GCP

Once these have been installed, additional Python dependencies required by Spell must be installed.

$ pip install --upgrade 'spell[cluster-gcp]'

To create the GKE cluster, run

$ spell cluster create-kube-cluster

Create the model

To begin, clone the spell-examples repository.

$ git clone https://github.com/spellml/examples.git
$ cd examples/keras

Then, use the cifar10_cnn.py script to train a CNN save it to SpellFS.

$ spell run \
    --machine-type k80 \
    --framework tensorflow \
    --pip keras \
    -- \
    python cifar10_cnn.py \
        --epochs 25 \
        --conv2_filter 65 \
        --conv2_kernel 7 \
        --dense_layer 621 \
        --dropout_3 0.0364

These parameters came from a hyperparameter search which was moderately successful. Whatever parameters you chose, it should produce a file at runs/<RUN ID>/keras/saved_models/keras_cifar10_trained_model.h5 on SpellFS.

Finally, we can convert this file into a Spell Model named cifar10 with the version name example from the output of the run.

$ spell model create \
    --file keras/saved_models/keras_cifar10_trained_model.h5:model.h5 \
    cifar10:example runs/<RUN_ID>

The --file argument instructs that the file at keras/saved_models/keras_cifar10_trained_model.h5 should be renamed model.h5 in this model.

Deploy the model server

The code used to load and serve this model is contained in modelservers/cifar/predictor.py in the spell-examples repository. We can create and start a model server with the cifar10 model with the spell server serve command.

$ spell server serve \
    --pip keras \
    cifar10:example predictor.py

This will create a Kubernetes deployment. Before you can run inference on the model, it needs to initialize. You can check the status of a server and find its URL using the spell server describe command.

$ spell server describe cifar10

Once the model server is running, you can test your model server either through issuing a cURL command using the URL given by the spell server describe command, or by using the spell server predict command. For this example, we have provided a script at modelservers/cifar/query_server.py directory of the spell-examples repository which takes a path to an image and the URL of the CIFAR model server, base64-encodes the image, and calls its predict endpoint.

$ python query_server.py http://url.to.model/server/predict path/to/image

Models

Spell allows you to promote a set of resources in SpellFS, either produced by a run or uploaded, to be a model. At least one file in a model should be able to be loaded in Python and used to perform inference. These models can be descriptively named and have many versions, allowing you to track a model's progress over time. You can also add notes to a model to help you and your coworkers track this progress. Models are created and managed via the spell model CLI command group or the web console.

Image of the Model details page.

Create a model

Models are created from Spell resources. Once a model file has been registered to SpellFS, either as the output of a run or as an upload, it can be bundled into a model using the spell model create command.

$ spell model create mymodel runs/1

By default, Spell will assign the model an incrementing version number (e.g. v1, v2, etc.), but a custom version string can be assigned using the --version flag. Note that an auto-incremented version number is still assigned if a custom version is given, and both the custom version string and the auto-incremented version number are valid identifiers for the model version.

$ spell model create --version demo mymodel runs/1

If the resource contains files that are not relevant to the model, you can also specify the desired files contained within the model using the --file flag.

$ spell model create --file out/model.hd5 --file out/context.json mymodel runs/1

Models can also be created via the web. There are three ways to get to the create flow.

  • Any completed run with outputs will have an 'Actions' drop down on the right side of the header. That contains a 'Create Model' entry.
  • For uploads and runs if you're navigating them in the Resource page browser there is a '...' on the right side of each row which contains a 'Create Model' entry.
  • Lastly, you can simply click 'Models' in the left column and there will be a '+ CREATE MODEL' button in the top right side of the page.

List models

List all models using the spell model command. This will list the model name, latest version, and other metadata about the model.

$ spell model
NAME           LATEST VERSION    CREATED      CREATOR
cifar          v1                7 days ago   ***
bert_squad     demo (v8)         11 days ago  ***
roberta_squad  v3                7 days ago   ***

You can list all the versions of a model via spell via spell model describe.

$ spell model describe cifar
VERSION    RESOURCE    FILES                                                 CREATED     CREATOR
v1         runs/3      saved_models/keras_cifar10_trained_model.h5:model.h5  7 days ago  ***

You can browse these via the 'Models' entry in the left column. You will be taken to a list of all models, and if you select one you will be taken to the list of versions of that model.

Delete models

Models and specific model versions can be deleted using the spell model rm command or via the '...' menu on a model's version list page.

Model servers

Model servers are applications that take a model artifact and some serving application code and wrap them in an API that can be called to run inference on the model. Servers are managed using the spell server CLI command group and can be found in the web console under 'Servers' in the left hand sidebar.

Create a model server

Model servers are created using the spell server serve command. This command takes in a model version and an “entrypoint” path, and starts a model server.

$ spell server serve cifar:v1 predictor.py

Once started, this will create and schedule a Kubernetes deployment that hosts the model server instances. The files for the model will be available in the /model directory within the model server.

Entrypoint

The model server entrypoint defines the application code of the model server. It contains a Python class which is responsible loading and running inference against a model. This class must inherit from spell.serving.BasePredictor, and must implement the following structure.

from spell.serving import BasePredictor

class Predictor(BasePredictor):
    def __init__(self):
        ...

    def predict(self, payload):
        ...

The __init__ method will be run at the start of every server, and will complete before the server begins accepting incoming requests. This method should be used to load the model and run any expensive preprocessing that can be done before the request comes in to make the prediction itself as fast as possible.

The predict method will be called every time a new request is received, and should be used to run inference on the model. The JSON body of the request will be passed as a dict to the payload argument. The return value of this method can be any of the following:

This method should be used to run inference on the model. See the predictor used in the quickstart example for a concrete demonstration of how to structure the entrypoint.

Note

Becuase model servers are multi-processed and distributed across multiple Kubernetes pods, any modifications to the state in your Predictor will not be propagated to other instances of your Predictor.

Custom health checks

By default, Spell servers provide a health check endpoint at GET /health which returns the response 200 {"status": "ok"}. This health check is used by Kubernetes to determine if the model server can receive requests. This can be customized by adding a health method to your predictor class.

class Predictor(BasePredictor):
    def __init__(self):
        ...

    def predict(self, payload):
        ...

    def health(self):
        ...

The health method can return any types the predict method can.

Handling non-JSON requests

Behind the scenes, Spell serving uses Starlette, and the entire Starlette Request object can be received by both the predict and the health methods either using type annotations or decorators. Using annotations, you can receive a full Request using the following syntax:

from starlette.requests import Request

class Predictor(BasePredictor):
    def __init__(self):
        ...

    def predict(self, payload: Request):
        ...

    def health(self, request: Request):
        ...

Using this syntax, any parameter given the Request type annotation will receive the full Request.

Alternatively, you can use decorator syntax.

from spell.serving import with_full_request

class Predictor(BasePredictor):
    def __init__(self):
        ...

    @with_full_request(name="payload")
    def predict(self, payload):
        ...

    @with_full_request()
    def health(self, request):
        ...

By default, this decorator will pass the Request into a parameter named "request", but this can be overridden by providing a "name" argument to the decorator.

(Advanced) Launching background tasks

Additional post-processing can be launched asynchronously after the response has been returned using Starlette Background Tasks. This can be done using either annotation or decorator syntax.

Using annotators, you can spawn background tasks using the following syntax:

from starlette.background import BackgroundTasks

async def some_task(foo, bar=1):
    ...

class Predictor(BasePredictor):
    def __init__(self):
        ...

    def predict(self, payload, tasks: BackgroundTasks):
        ...
        tasks.add_task(some_task, “my_foo”, bar=3)

    def health(self, background: BackgroundTasks):
        ...
        background.add_task(some_task, “other_foo”)

Using this syntax, any parameter given the BackgroundTasks type annotation will receive the BackgroundTasks object.

Alternatively, you can use decorator syntax.

from spell.serving import with_background_tasks


async def some_task(foo, bar=1):
    ...

class Predictor(BasePredictor):
    def __init__(self):
        ...

    @with_background_tasks(name=”bg_tasks”)
    def predict(self, payload, bg_tasks):
        ...
        bg_tasks.add_task(some_task, "my_foo", bar=3)

    @with_background_tasks()
    def health(self, tasks):
        ...
        tasks.add_task(some_task, "my_foo", bar=3)

By default, this decorator will pass the BackgroundTasks into a parameter named "tasks", but this can be overridden by providing a "name" argument to the decorator.

Accessing model metadata

Some metadata about the current model can be found in the self.model_info attribute of the Predictor. This is a namedtuple containing fields for the name and version of the model for the server.

Adding mounts

Some model servers may depend on extra file artifacts beyond just those included with the model itself. Any artifact registered to SpellFS can be passed to model servers with the --mount flag.

Note that unlike runs, mounts in model servers are restricted to the /mounts directory, and any specified destination is interpreted as being relative to /mounts. For example, in the following configuration, the artifact uploads/config would be available within the server as /mounts/config, and the artifact uploads/text would be available as /mounts/vocab.

$ spell server serve \
    --mount uploads/config \
    --mount uploads/text:vocab \
    cifar10:example predictor.py

Adding Predictor configuration YAML

Configuration information can be provided to the Predictor as either a JSON or YAML file using the --config flag. For example, if you had a configuration named predict-config.yaml such as

use_feature_x:true
additional_info:
   output_type: bytes

Then you could modify the __init__ of the Predictor to accept this configuration by adding additional arguments

class Predictor(BasePredictor):
    def __init__(self, use_feature_x, additional_info=None):
        self.use_feature_x = use_feature_x # True
        self.additional_info = additional_info or {}
        # {“output_type”: “bytes”}
        ...

The serve command would now be

$ spell server serve \
    --config /path/to/predict-config.yaml \
    cifar:v1 predictor.py

Installing dependencies

Additional dependencies can be provided to the model server environment using similar flags as runs. Model servers support adding Pip dependencies via --pip and --pip-req, Conda environments via --conda-file, and APT dependencies via --apt. Custom Docker images are not yet supported.

(Advanced) Managing autoscaling configuration

The serve command also comes with flags that allow you to tune the autoscaling and scheduling configuration of the model serving deployment. Spell ships with sane defaults for these parameters, so they are not necessary for putting up a basic model server.

  • --min-pods/--max-pods: Set the minimum and maximum number of pods that autoscaling should schedule.
  • --target-cpu-utilization: Set the average CPU usage at which to signal the autoscaler to schedule a new pod.
  • --target-requests-per-second: Set the average HTTP(S) requests per second, per pod, at which to signal the autoscaler to schedule a new pod. Can be used in combination with --target-cpu-utilization.
  • --cpu-request/--ram-request: Set the CPU/memory request values of the pods to adjust how they are scheduled on the cluster.
  • --cpu-limit/--ram-limit: Set CPU/memory limits on the pods to limit the amount of resources they consume on the node
  • --gpu-limit: Maximum number of GPUs allowable to each pod. Fractional GPU limits are not supported.

Monitoring model servers

You can list all model servers with the spell server command. This will list the name, endpoint URL, and pod status of the model servers. You can also explore the server list page in the web console by selecting 'Servers' in the left hand navigation sidebar.

$ spell server
NAME   URL                            PODS (READY/TOTAL)    AGE
cifar  http://.../cifar/predict       1/1                   1 day

For more detailed information about a model server, use the spell server describe command or navigate to the details page.

To retrieve the logs of a running model server, use the spell server logs command or scroll to the bottom of the details page. Logs are broken up by pod.

You can see hardware metrics (CPU usage and memory per pod) as well as request metrics (request rate, request failures, and latency) in the web console.

(Advanced) Access kubernetes cluster directly

Spell provides a convenience utility spell cluster kubectl to query the EKS/GKE Kubernetes cluster underlying a model serving deployment. This is only intended for advanced users who are familiar with Kubernetes.

Note

Using kubectl recklessly has the potential to break your model serving deployment. Avoid using commands that alter Kubernetes state. kubectl get and kubectl describe are safe operations.

Updating a model server

All components of a model server except its name can be updated using the spell server update command. This will result in a zero-downtime rollout of the updated model server.

Stopping and starting model servers

Stopping a model server unschedules it from the cluster without deleting it altogether. This can be useful to save on node resources while still keeping the model server itself around to start up again later. Servers can be stopped with the spell server stop command and started with the spell server start command as well as from the web console.

Deleting a model server

Model servers can be deleted with the spell server rm command as well as from the web console. If a model server has not been stopped, the -f flag can be used to stop and delete a model server.

Node groups

Node groups allow model servers to be scheduled to specific kinds of machines, akin to machine types for runs. node groups are managed with the spell cluster node-group command group. Note that GPU support and GPU-based autoscaling are still in active development at time of writing.

Create a node group

Node groups can be created with the spell cluster node-group add command. This will create an EKS node group in AWS and a GKE node pool in GCP. In the simplest configuration, the command takes a name for the node group, and an instance type.

$ spell cluster node-group add \
    --name t4 \
    --instance-type g4dn.xlarge

Node groups can also be deployed on spot types with the --spot flag.

$ spell cluster node-group add \
    --name t4-spot \
    --instance-type g4dn.xlarge \
    --spot

For advanced node group configuration, it is also possible to pass an eksctl ClusterConfig that defines a node group. See the eksctl docs for more details.

$ spell cluster node-group add \
    --name t4 \
    --config-file custom_nodegroup.yaml

Create a GPU-enabled node group

To save costs, GPU model serving is not enabled by default. To enable GPU support for model servers, create a GPU-enabled node group.

If you are using Amazon EKS, use the --instance-type flag:

spell cluster node-group add --name gpu --instance-type g4dn.xlarge

If you are using Google GKE, you must specify the desired type of GPU using the --accelerator flag in addition to the --instance-type flag.

For example:

spell cluster node-group add --name gpu --instance-type n1-standard-1 --accelerator nvidia-tesla-t4

Make sure that your GCP account has quota for GPUs. Additional quota can be requested manually.

Not all GPU types are available in all GCP regions. Use gcloud compute accelerator-types list to view GPUs available for your region.

List node groups

Node groups can be listed with the spell cluster node-group command. This will list the names of the node groups as well as instance specifications and autoscaling configurations.

$ spell cluster node-group
NAME     INSTANCE TYPE    DISK SIZE    MIN NODES    MAX NODES
default  m5.large         50           1            2
t4       g4dn.xlarge      40           0            0

You can also explore the nodes groups and their current utilization in the web console under a tab on the clusters page.

Image of the Node Group tab.

Scale a node group

Node groups can be scaled to adjust the minimum and maximum number of nodes that node autoscaling should keep between. This is done via the spell cluster node-group scale command.

Delete a node group

Node groups can be deleted through the spell cluster node-group delete command.

Assign a model server to a node group

Model servers can be assigned to a specific node group on creation time via the --serving-group flag. If omitted, the server will be assigned to the default node group of the cluster.

$ spell server serve \
    --serving-group t4 \
    cifar:v1 predictor.py