# How to Configure Networking for Multi-Node GPU Worker Nodes

DigitalOcean Kubernetes (DOKS) is a Kubernetes service with a fully managed control plane, high availability, and autoscaling. DOKS integrates with standard Kubernetes toolchains and DigitalOcean’s load balancers, volumes, CPU and GPU Droplets, API, and CLI.

Multi-node GPU clusters can only be created in multiples of 8 GPUs and are available [by contract only](https://www.digitalocean.com/company/contact/sales). For more information on supported GPUs, see [GPU Worker Nodes](https://docs.digitalocean.com/products/kubernetes/details/supported-gpus/index.html.md).

In a multi-node configuration, 8-GPU configurations are connected via a dedicated high-speed networking fabric in the DOKS cluster. The networking fabric is exposed on worker nodes through eight network interface controllers (NICs) named `fabric0, fabric1, …, fabric7`, which exist alongside the regular `eth0` and `eth1` interfaces. The `eth0` interface provides public internet connectivity, and `eth1` provides private connectivity to other nodes in the same VPC network. The fabric NICs enable AI/ML workloads to exchange data with very low latency and high throughput. To achieve high networking performance, we recommend using the Remote Direct Memory Access (RDMA) networking protocol for communication between the GPU nodes through the fabric NICs, which completely bypasses the CPU and kernel of the operating system for data transfer.

Additional plugins are required to enable the high-speed fabric for multi-node GPU networking. This guide covers the additional required components and how to configure them.

## Required Plugins

To use the high-speed fabric with container-based workloads, the following Kubernetes plugins must be available on clusters with AMD or NVIDIA GPUs:

- [Mellanox k8s-rdma-shared-dev-plugin](https://github.com/Mellanox/k8s-rdma-shared-dev-plugin): This plugin is automatically installed in your DOKS cluster when you add a node pool with a fabric-connected slug. It exposes RDMA-related resources as Kubernetes resources, named `rdma/fabric0, rdma/fabric1, rdma/fabric2, …, rdma/fabric7`. You can [manage these resources using resource requests and limits](#manage-rdma-related-resources) in your manifests.
- [Multus CNI plugin](https://github.com/k8snetworkplumbingwg/multus-cni): You must install this plugin manually. It moves the NICs `fabric0, fabric1,..., fabric7` into the container namespace via the `host-device` plugin. To install the plougin, run the following command:

```shell
kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml
```

**Note**: Regular public and private communication in the clusters via `eth0` and `eth1` are not affected by the installation of the Multus CNI and continue to use [Cilium](https://github.com/cilium/cilium/).

After installing the CNI plugin, create `NetworkAttachmentDefinition` resources for the fabric NICs as described in the [Configure Multus CNI Plugin](#configure-multus-cni-plugin) section below.

## Manage RDMA-Related Resources

Expose the RDMA-related resources managed by the Mellanox k8s-rdma-shared-dev-plugin to your workloads. To do this for AMD GPU nodes, add the following [resource requests and limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) to your Pod or Deployment manifest:

```yaml
resources:
  requests:
    amd.com/gpu: 8
    rdma/fabric0: 1
    rdma/fabric1: 1
    rdma/fabric2: 1
    rdma/fabric3: 1
    rdma/fabric4: 1
    rdma/fabric5: 1
    rdma/fabric6: 1
    rdma/fabric7: 1
  limits:
    amd.com/gpu: 8
    rdma/fabric0: 1
    rdma/fabric1: 1
    rdma/fabric2: 1
    rdma/fabric3: 1
    rdma/fabric4: 1
    rdma/fabric5: 1
    rdma/fabric6: 1
    rdma/fabric7: 1
```

For NVIDIA GPU nodes, replace `amd.com/gpu: 8` with `nvidia.com/gpu: 8`.

## Configure Multus CNI Plugin

In addition to the RDMA-related resources, you must make the fabric NICs `fabric0, fabric1, ..., fabric7` available to your containers. To do this, configure a set of `NetworkAttachmentDefinition` resources that use the `host-device` CNI plugin to expose each NIC.

Create a config file that contains the following fabric NICs:

```yaml
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric0
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric0"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric1
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric1"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric2
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric2"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric3
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric3"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric4
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric4"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric5
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric5"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric6
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric6"
    }'
---
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: roce-net-fabric7
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "host-device",
      "device": "fabric7"
    }'
```

Install the resources in your desired namespace using the following command:

```
kubectl apply -f <your-manifest>.yaml --namespace=<your-namespace>
```

Next, make the fabric NICs available in your containers by adding an annotation to your Pod or Deployment manifest:

```yaml
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: >-
      roce-net-fabric0@fabric0,
      roce-net-fabric1@fabric1,
      roce-net-fabric2@fabric2,
      roce-net-fabric3@fabric3,
      roce-net-fabric4@fabric4,
      roce-net-fabric5@fabric5,
      roce-net-fabric6@fabric6,
      roce-net-fabric7@fabric7
```

Use `kubectl apply` to apply the updates.

You can also reference `NetworkAttachmentDefinition` from another namespace by using the namespace resource name in the annotation (for example, `custom-namespace/roce-net-fabric0@fabric0`). Each fabric NIC can only be attached to a single container at a time.

Once the `fabric0, fabric1, ..., fabric7` NICs are available in the containers, high-speed networking using RDMA is enabled between the GPU nodes.