DigitalOcean Kubernetes (DOKS) is a managed Kubernetes service. Deploy Kubernetes clusters with a fully managed control plane, high availability, autoscaling, and native integration with DigitalOcean Load Balancers and volumes. DOKS clusters are compatible with standard Kubernetes toolchains and the DigitalOcean API and CLI.
GPU worker nodes are now in early availability for select DOKS users. You can either create a new cluster or add a GPU node pool on an existing cluster on versions 1.30.4-do.0, 1.29.8-do.0, 1.28.13-do.0, and later.
GPU worker nodes are built on GPU Droplets, which are powered by NVIDIA’s H100 GPUs.
Using GPU worker nodes in your cluster, you can:
You do not need to specify a Runtime Class to run GPU workloads, which makes it easier to set up GPU workloads.
GPU Droplets have NVIDIA H100 GPUs in either a single GPU or 8 GPU configuration, and come with two different kinds of storage:
Boot disk: A local, persistent disk on the Droplet to store data for software like the operating system and ML frameworks.
Scratch disk: A local, non-persistent disk to store data for staging purposes, like inference and training.
The following table summarizes additional specifications:
GPU Vendor | NVIDIA H100 | NVIDIA H100x8 |
---|---|---|
GPUs per Droplet | 1 | 8 |
GPU Memory | 80 GB | 640 GB |
Droplet Memory | 240 GB | 1920 GB |
Droplet vCPUs | 20 | 160 |
Local Storage: Boot Disk | 720 GiB NVMe | 2 TiB NVMe |
Local Storage: Scratch Disk | 5 TiB NVMe | 40 TiB NVMe |
Network BandwidthMaximum Speeds |
10 Gbps public 25 Gbps private |
10 Gbps public 25 Gbps private |
Slug | gpu-h100x1-80gb |
gpu-h100x8-640gb |
GPU worker nodes are priced per second at the same price as the GPU Droplets. See DOKS node pools pricing to learn more about pricing.
For reservation and contract pricing, contact your sales representative or Customer Success Manager, or send a request using the H100 GPU Worker Nodes form.
GPU worker nodes for DOKS are currently available in DigitalOcean’s TOR1 datacenter. We plan to support additional datacenters in the near future.
You cannot currently scale the GPU node pools down to zero.
You need to monitor GPU usage and manage scaling manually.
We do not currently support creating or adding GPU worker nodes using the DigitalOcean Control Panel.
You can use the DigitalOcean CLI or API to create a new cluster or add a GPU node pool on an existing cluster.
To create a cluster with a GPU worker node, run doctl kubernetes cluster create
specifying the GPU machine type. The following example creates a cluster with a worker node in single GPU configuration with 80 GB of memory and 3 node pools:
doctl kubernetes cluster create gpu--cluster --region tor1 --version 1.30.4-do.0 --node-pool "name=gpu-worker-pool;size=gpu-h100x1-80gb;count=3"
To add a GPU worker node to an existing cluster, run doctl kubernetes cluster node-pool create
specifying the GPU machine type. The following example adds a GPU worker node in single GPU configuration with 80 GB of memory and 4 node pools to a cluster named gpu-cluster
:
doctl kubernetes cluster node-pool create gpu-cluster --name gpu-worker-pool-1 --size gpu-h100x1-80gb --count 4
To create a cluster with a GPU worker node, send a POST
request to https://api.digitalocean.com/v2/kubernetes/clusters
with the following request body:
curl --location 'https://api.digitalocean.com/v2/kubernetes/clusters' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $DIGITALOCEAN_TOKEN' \
--data '{
"name": "gpu--cluster",
"region": "tor1",
"version": "1.30.4-do.0",
"node_pools": [
{
"size": "gpu-h100x1-80gb",
"count": 3,
"name": "gpu-worker-pool"
}
]
}'
This creates a cluster with a worker node in single GPU configuration with 80 GB of memory and 3 node pools.
To add a GPU worker node to an existing cluster, send a POST
request to https://api.digitalocean.com/v2/kubernetes/clusters
with the following request body:
curl --location --request POST 'https://api.digitalocean.com/v2/kubernetes/clusters/{cluster_id}/node_pools' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $DIGITALOCEAN_TOKEN' \
--data '{
"node_pools": [
{
"size": "gpu-h100x1-80gb",
"count": 4,
"name": "new-gpu-worker-pool"
}
}',
This adds a GPU worker node in single GPU configuration with 80 GB of memory and 4 node pools to an existing cluster specified by its cluster ID cluster_id
.
DigitalOcean applies the following labels and taint to the GPU worker nodes:
Label | Taint |
---|---|
doks.digitalocean.com/gpu-brand: nvidia doks.digitalocean.com/gpu-model: h100 |
nvidia.com/gpu:NoSchedule |
You can use the labels and taint in your workload specification to schedule pods that match as shown next.
Once you have a GPU worker node on your cluster, you can run GPU workloads. You can use a configuration spec, similar to the sample pod spec, for your actual workloads. The following spec shows how to create a pod that runs NVIDIA’s CUDA image and uses the labels and taint for GPU worker nodes:
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
restartPolicy: Never
nodeSelector:
doks.digitalocean.com/gpu-brand: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
DigitalOcean installs and manages the drivers required to enable GPU support on the GPU worker nodes. However, for GPU discovery, health checks, configuration of GPU-enabled containers, and time slicing, you need an additional component called NVIDIA device plugin for Kubernetes. You can configure the plugin and deploy it using helm
, as described in the README
file of the GitHub repository.
For monitoring your cluster using Prometheus, you need to install NVIDIA DCGM Exporter.
For support or troubleshooting, open a support ticket.
For feedback or questions about the GPU worker nodes offering, contact your account representative.