How to Configure Networking for Multi-Node GPU Worker Nodes

Validated on 7 Nov 2025 • Last edited on 5 Dec 2025

DigitalOcean Kubernetes (DOKS) is a Kubernetes service with a fully managed control plane, high availability, and autoscaling. DOKS integrates with standard Kubernetes toolchains and DigitalOcean’s load balancers, volumes, CPU and GPU Droplets, API, and CLI.

Multi-node GPU clusters can only be created in multiples of 8 GPUs and are available by contract only. For more information on supported GPUs, see GPU Worker Nodes.

In a multi-node configuration, 8-GPU configurations are connected via a dedicated high-speed networking fabric in the DOKS cluster. The networking fabric is exposed on worker nodes through eight network interface controllers (NICs) named fabric0, fabric1, …, fabric7, which exist alongside the regular eth0 and eth1 interfaces. The eth0 interface provides public internet connectivity, and eth1 provides private connectivity to other nodes in the same VPC network. The fabric NICs enable AI/ML workloads to exchange data with very low latency and high throughput. To achieve high networking performance, we recommend using the Remote Direct Memory Access (RDMA) networking protocol for communication between the GPU nodes through the fabric NICs, which completely bypasses the CPU and kernel of the operating system for data transfer.

Additional plugins are required to enable the high-speed fabric for multi-node GPU networking. This guide covers the additional required components and how to configure them.

Required Plugins

To use the high-speed fabric with container-based workloads, the following Kubernetes plugins must be available on clusters with AMD or NVIDIA GPUs:

  • Mellanox k8s-rdma-shared-dev-plugin: This plugin is automatically installed in your DOKS cluster when you add a node pool with a fabric-connected slug. It exposes RDMA-related resources as Kubernetes resources, named rdma/fabric0, rdma/fabric1, rdma/fabric2, …, rdma/fabric7. You can manage these resources using resource requests and limits in your manifests.

  • Multus CNI plugin: You must install this plugin manually. It moves the NICs fabric0, fabric1,..., fabric7 into the container namespace via the host-device plugin. To install the plougin, run the following command:

kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml
Note
Regular public and private communication in the clusters via eth0 and eth1 are not affected by the installation of the Multus CNI and continue to use Cilium.

After installing the CNI plugin, create NetworkAttachmentDefinition resources for the fabric NICs as described in the Configure Multus CNI Plugin section below.

Expose the RDMA-related resources managed by the Mellanox k8s-rdma-shared-dev-plugin to your workloads. To do this for AMD GPU nodes, add the following resource requests and limits to your Pod or Deployment manifest:

resources:
  requests:
    amd.com/gpu: 8
    rdma/fabric0: 1
    rdma/fabric1: 1
    rdma/fabric2: 1
    rdma/fabric3: 1
    rdma/fabric4: 1
    rdma/fabric5: 1
    rdma/fabric6: 1
    rdma/fabric7: 1
  limits:
    amd.com/gpu: 8
    rdma/fabric0: 1
    rdma/fabric1: 1
    rdma/fabric2: 1
    rdma/fabric3: 1
    rdma/fabric4: 1
    rdma/fabric5: 1
    rdma/fabric6: 1
    rdma/fabric7: 1

For NVIDIA GPU nodes, replace amd.com/gpu: 8 with nvidia.com/gpu: 8.

Configure Multus CNI Plugin

In addition to the RDMA-related resources, you must make the fabric NICs fabric0, fabric1, ..., fabric7 available to your containers. To do this, configure a set of NetworkAttachmentDefinition resources that use the host-device CNI plugin to expose each NIC.

Create a config file that contains the following fabric NICs:

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric0
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric0"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric1
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric1"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric2
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric2"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric3
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric3"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric4
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric4"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric5
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric5"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric6
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric6"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric7
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric7"
    }'

Install the resources in your desired namespace using the following command:

kubectl apply -f <your-manifest>.yaml --namespace=<your-namespace>

Next, make the fabric NICs available in your containers by adding an annotation to your Pod or Deployment manifest:

metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: >-
      roce-net-fabric0@fabric0,
      roce-net-fabric1@fabric1,
      roce-net-fabric2@fabric2,
      roce-net-fabric3@fabric3,
      roce-net-fabric4@fabric4,
      roce-net-fabric5@fabric5,
      roce-net-fabric6@fabric6,
      roce-net-fabric7@fabric7

Use kubectl apply to apply the updates.

You can also reference NetworkAttachmentDefinition from another namespace by using the namespace resource name in the annotation (for example, custom-namespace/roce-net-fabric0@fabric0). Each fabric NIC can only be attached to a single container at a time.

Once the fabric0, fabric1, ..., fabric7 NICs are available in the containers, high-speed networking using RDMA is enabled between the GPU nodes.

We can't find any results for your search.

Try using different keywords or simplifying your search terms.