Cluster Machine Type Management

For users on the Spell for Teams plan, we deploy Spell in your cloud and provide the same cluster management tools backing our own internal infrastructure. This means you can keep your data in your own S3 buckets, perform runs on your own machines, and deploy models within your own cloud infrastructure. If you are looking to instead use personal private machines in your own datacenter, office, or home, you can find the instructions in the guide on private machines.

Once you have set up an AWS or GCP cluster, you can set up distinct cloud machine types to run runs on. This guide helps you create machines in your cluster.

Creating a new machine type

Machine Type Creation Dialog

Under Clusters click on your cluster, then click on "Add New Machine Type". This will bring up a menu of configuration options:

  • Name. This name will be referenced by the --machine_type parameter when you create runs.
  • Instance Type. Select whether you want a machine with a Graphical Processing Unit (GPU) or without one (CPU). Then use the dropdown menu to choose from our list of preselected instance types. For a list of the instance types available see the section "Available instance types".
  • On Demand/Spot. Spot instances are significantly cheaper than on demand instances, but come with the caveat that they may be reclaimed by the cloud provider (AWS/GCP) at any time. For recommendations on when to use which type, see the section "Using preemptible instances".
  • Storage Size. This is the size of disk attached to the machine at startup. Keep in mind that some of this space will be reserved by Spell-related images; precisely how much depends on which/how many framework images you assign to this machine type. Note that Spell will automatically resize your disk as you fill it up: see the section "Automatic resizing" for details.
  • Additional Images. All instances come with the default framework (TensorFlow 1, PyTorch, and Conda) image pre-installed. Use these checkboxes to attach additional framework images. Consult the section "Available frameworks" for details.
  • Machine Limits Specify the minimum and maximum number of machines that can be running at one time, and an idle timeout. For guidance on what to set these values to see the next section of this guide, "Guidance on machine types".

Guidance on machine limits

TLDR: we recommend setting Minimum Machines to 0 (the default), Maximum Machines to the size of your team, and Idle Timeout to 30 minutes (the default).

Spell will attempt to maintain Minimum Machines ready at all times, and will spin up additional machines as needed to accommodate bursts in usage.

Spinning up a machine takes time, so keeping some number of machines idle can help speed up development (connecting to an idle machine is nearly instantaneous). On the other hand, machines kept idle but unused used cost money without providing any benefit. For most use cases we recommend setting Minimum Machines to 0; this ensures that you are only paying for compute that you actually use.

The total machines for the machine type will never exceed Maximum Machines. By default, we recommend matching this value to the size of your team. For example, if you have 8 data scientists on staff, set Maximum Machines to 8.

AWS/GCP enforce resource limits on instance types. Some of these resource limits are very aggressive, e.g. by default AWS only allows one concurrent machine of most GPU instances. Before finishing cluster setup, double-check your service account limits with your cloud provider. If this is too low, you may put in a request for a service limit increase or resource quota increase. These are usually approved quickly, within 24 hours. Avoid setting Maximum Machines higher than your service limit for the instance.

The last machine limit value is the Idle Timeout. This value controls how long Spell will wait before spinning down machines that have finished running. If another run gets scheduled in the idle time period, it will get started almost immediately, as it does not have to wait for another cloud machine to become available and spin up. We recommend keeping Idle Timeout on the default setting of 30 minutes for most instance types. Consider setting Idle Timeout to 0 minutes for especially powerful/expensive instance types that you expect to use rarely.

Modifying an existing machine type

Machine Type Edit Dialog

Once you have a machine type set up, you can modify its Minimum and Maximum machines, as well as the Idle Timeout. Use this if you are running many runs in parallel and need more machines, or to terminate all existing machines when you know that you are not going to use them for some amount of time.

Using preemptible instances

When you create a new machine type you can select Preemptible/Spot instead of On Demand. These machines are significantly cheaper then their On Demand counterparts, however AWS/GCP can terminate them at any time.

Preemptible instances are ideal for machine learning training jobs which are fault tolerant. Any files you save to disk over the course of the run are automatically backed up to SpellFS even if the run is interrupted. If your training code is configured to use checkpoints, it should easy to then mounts these checkpoints into a new run and finish training from there. A code sample demonstrating this pattern in action is available in the spell examples repository.

If your your script is written such that it idempotently resumes wherever it left off given a prior run's disk state, then you can go one step further using Spell's "Auto Resume" feature. This can be configured using the --auto-resume/--disable-auto-resume flag in the spell run command (docs on the run command available here). When a run is interrupted by the respective cloud service and auto resume is enabled, Spell will create a new run with identical parameters, restore the interrupted run's saved disk state, and queue it up. When a machine becomes available again (usually when the demand lowers), the resumed run will execute, continuing the computation of the interrupted run.

You can also configure any of your preemptible machine types to have a default auto resume behavior when editing or creating a preemptible machine type.

Automatic resizing

Spell will automatically resize a run when your disk when it is nearing capacity without interrupting the run. Using any more than 80% of your disk will trigger a resize doubling that disk's capacity. Larger disk size increases cost, but Spell recycles machines that are no longer in use to ensure you are using machines efficiently and keeping costs down.

Effectively, this means that runs will always start with the the disk size you specified during machine type creation and resize to larger disks as needed.

AWS disks can resize slowly and only once every six hours. These are limitations of AWS.

GCP disks can resize quickly and multiple times in succession. If you need more disk, the disk will keep resizing.

Deleting a machine type

Once you know that you no longer need a machine type, you can delete it. This will do the following:

  1. All runs queued on that machine type will be preemptively killed.
  2. All machines will be terminated.
  3. The machine type will be removed from your cluster dashboard, and you will not be able to schedule future runs on this machine type. However, you can always create a new machine type, even with the same name.

Available instance types

CPUs

Instance Type AWS GCP
cpu c5.large n1-standard-2
cpu-big c5.4xlarge n1-standard-16
cpu-huge c5.18xlarge n1-standard-96
ram-big r5.4xlarge
ram-huge r5.24xlarge

GPUs

Instance Type AWS GCP Instance GCP Accelerator
K80 p2.xlarge n1-highmem-4 nvidia-tesla-k80 x 1
K80x2 n1-highmem-16 nvidia-tesla-k80 x 2
K80x4 n1-highmem-32 nvidia-tesla-k80 x 4
K80x8 p2.8xlarge n1-highmem-64 nvidia-tesla-k80 x 8
V100 p3.2xlarge n1-highmem-4 nvidia-tesla-v100 x 1
V100x4 p3.8xlarge n1-highmem-32 nvidia-tesla-v100 x 4
V100x8 p3.16xlarge n1-highmem-64 nvidia-tesla-v100 x 8
V100x8-big p3dn.24xlarge
P100 n1-highmem-4 nvidia-tesla-p100 x 1
P100x2 n1-highmem-16 nvidia-tesla-p100 x 2
P100x4 n1-highmem-32 nvidia-tesla-p100 x 4
T4 g4dn.xlarge n1-highmem-4 nvidia-tesla-T4 x 1
T4-big g4dn.4xlarge
T4-huge g4dn.16xlarge
T4x2 n1-highmem-16 nvidia-tesla-T4 x 2
T4x4 g4dn.12xlarge n1-highmem-32 nvidia-tesla-T4 x 4

For support of other instance types, please contact support@spell.ml.

Available frameworks

All machines come with the default framework installed, which supports PyTorch (torch==1.5.0, torchvision==0.6.0 and pytorch-lightning=0.8.4), TensorFlow 1 (tensorflow==1.14.0), and Conda.

You may also include additional frameworks by toggling the checkboxes on the machine definition in the cluster management pane. Note that including additional frameworks increases the footprint on disk of Spell-related resources as well as the amount of time the machine takes to start up, so it's a good idea to be picky about which frameworks you provide on a machine type.

Framework Notes Approximate disk size CUDA toolkit
Default tensorflow==1.14.0
torch==1.4.0
torchvision==0.5.0
10.0
TensorFlow 2 tensorflow==2.0.0 2GB 10.1
FastAI fastai==1.0.52 4GB 10.1