# GPU Worker Nodes DigitalOcean Kubernetes (DOKS) is a Kubernetes service with a fully managed control plane, high availability, and autoscaling. DOKS integrates with standard Kubernetes toolchains and DigitalOcean’s load balancers, volumes, CPU and GPU Droplets, API, and CLI. GPU worker nodes are built on [GPU Droplets](https://docs.digitalocean.com/products/droplets/details/features/index.html.md#gpu-droplets), which are powered by AMD and NVIDIA GPUs. Using GPU worker nodes in your cluster, you can: - Experiment and develop AI/ML applications in containerized environments - Run distributed AI workloads on Kubernetes - Scale AI inference services ## Available GPU Node Pools We offer the following GPU options for creating node pools: | AMD GPU | Slug | |---|---| | Instinct MI300X | `gpu-mi300x1-192gb` | | Instinct MI300X (8x) | `gpu-mi300x8-1536gb` | | Instinct MI300X (8x) for multi-node setup | [By contract](https://www.digitalocean.com/company/contact/sales) | | Instinct MI325X | [By contract](https://www.digitalocean.com/company/contact/sales) | | Instinct MI325X (8x) | [By contract](https://www.digitalocean.com/company/contact/sales) | | Instinct MI325X (8x) for multi-node setup | [By contract](https://www.digitalocean.com/company/contact/sales) | | Instinct MI350X | [By contract](https://www.digitalocean.com/company/contact/sales) | | Instinct MI350X (8x) | [By contract](https://www.digitalocean.com/company/contact/sales) | | Instinct MI350X (8x) for multi-node setup | [By contract](https://www.digitalocean.com/company/contact/sales) | | NVIDIA GPU | Slug | |---|---| | H100 | `gpu-h100x1-80gb` | | H100 (8x) | `gpu-h100x8-640gb` | | H100 (8x) for multi-node setup | [By contract](https://www.digitalocean.com/company/contact/sales) | | H200 (8x) for multi-node setup | [By contract](https://www.digitalocean.com/company/contact/sales) | | L40s | `gpu-l40sx1-48gb` | | RTX 4000 | `gpu-4000adax1-20gb` | | RTX 6000 | `gpu-6000adax1-48gb` | **Note**: For multiple 8 GPU H100 worker nodes, we support high-speed networking between GPUs from different nodes. High-speed communication uses 8x Mellanox 400GbE interfaces. To enable this, submit the [H100 multi-node setup form](https://anchor.digitalocean.com/multi-node-h100s-request.html). You can use the 8-GPU configuration GPUs in a multi-node set up, where the GPUs are connected via a dedicated high-speed networking fabric. To learn how to configure high-speed networking for multi-node GPUs, see [How to Use Multi-Node GPUs](https://docs.digitalocean.com/products/kubernetes/how-to/configure-multinode-gpus/index.html.md). ## Runtime, Drivers, and Plugins To run GPU workloads, you do not need to specify a [Runtime Class](https://kubernetes.io/docs/concepts/containers/runtime-class/). DigitalOcean also installs and manages the required drivers to enable the GPU worker nodes as described below. For the latest versions installed, see the [DOKS changelog](https://docs.digitalocean.com/products/kubernetes/details/changelog/index.html.md#available-versions). ### AMD GPUs For AMD GPUs, we install the following drivers: - [AMDGPU driver](https://www.amd.com/en/support/download/linux-drivers.html) - [AMD ROCm](https://www.amd.com/en/products/software/rocm.html) We also recommend the following additional software: - [ROCm Device Plugin for Kubernetes](https://github.com/ROCm/k8s-device-plugin) for GPU discovery, health checks, configuration of GPU-enabled containers, and time slicing. We automatically deploy this component when you create or update a cluster. You can turn this option off by setting `amd_gpu_device_plugin` to `false` in the request body when [creating](https://docs.digitalocean.com/reference/api/reference/kubernetes/index.html.md#kubernetes_create_cluster) or [updating](https://docs.digitalocean.com/reference/api/reference/kubernetes/index.html.md#kubernetes_update_cluster) a cluster using the API. - [AMD Device Metrics Exporter](https://github.com/ROCm/device-metrics-exporter) for ingesting GPU metrics into your monitoring system. You can install this plugin by setting `amd_gpu_device_metrics_exporter_plugin` to `true` in the request body when [creating](https://docs.digitalocean.com/reference/api/reference/kubernetes/index.html.md#kubernetes_create_cluster) or [updating](https://docs.digitalocean.com/reference/api/reference/kubernetes/index.html.md#kubernetes_update_cluster) a cluster using the API. The plugin is installed in the `kube-system` namespace of the Kubernetes cluster. ### NVIDIA GPUs For NVIDIA GPUs, we install the following drivers: - [NVIDIA CUDA drivers](https://www.nvidia.com/en-us/drivers/) - [NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) - [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit) We also recommend the following additional software: - [NVIDIA device plugin for Kubernetes](https://github.com/NVIDIA/k8s-device-plugin) for GPU discovery, health checks, configuration of GPU-enabled containers, and time slicing. You can configure the plugin and deploy it using `helm` as described in the `README` file of the GitHub repository. - [NVIDIA DCGM Exporter](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html) for monitoring your cluster using [Prometheus](https://docs.digitalocean.com/products/marketplace/catalog/kubernetes-monitoring-stack/index.html.md). ## Additional Features DigitalOcean applies additional labels and taints to the GPU worker nodes. For more information, see [Automatic Application of Labels and Taints to Nodes](https://docs.digitalocean.com/products/kubernetes/details/managed/index.html.md#automatic-application-of-labels-and-taints-to-nodes). You can also use the [cluster autoscaler](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#basics) to automatically scale the GPU node pool down to zero, or use the DigitalOcean CLI or API to manually scale the node pool down to 0. Autoscaling is useful when using on-demand and for jobs like training and fine-tuning.