Description of image

How to Configure NVLink on Machines

Machines are high-performing computing for scaling AI applications.


Use NVLink to accelerate your training workloads. Learn more about NVLink.

Follow these steps to enable NVLink on your A100-80Gx8 Ubuntu machines. NVLink is not available on Windows or CentOS 7.

Script-based Setup

Download and run our helper script.

wget http://softupdate.paperspace.io/configure-nvlink.sh -O - | bash

Restart your machine.

After reboot you should see your GPUs connected with NVLink when running nvidia-smi topo -m:

paperspace@pspe7ld5x:~$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity
GPU0	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	0-95		N/A
GPU1	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NV12	0-95		N/A
GPU2	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NV12	0-95		N/A
GPU3	NV12	NV12	NV12	 X 	NV12	NV12	NV12	NV12	0-95		N/A
GPU4	NV12	NV12	NV12	NV12	 X 	NV12	NV12	NV12	0-95		N/A
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	NV12	NV12	0-95		N/A
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	NV12	0-95		N/A
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	0-95		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (for example, QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Step-by-step Setup

Update your drivers to the latest version, currently 515.

sudo apt update
sudo apt install nvidia-driver-515

Install NVIDIA Fabric Manager (learn more here). The version installed must match your driver version.

sudo apt install cuda-drivers-fabricmanager-515

Enable persistence mode for your GPUs.

At this point your nvidia-smi output should look like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:00:05.0 Off |                    0 |
| N/A   32C    P0    54W / 400W |    136MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:00:06.0 Off |                    0 |
| N/A   29C    P0    53W / 400W |      4MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   28C    P0    52W / 400W |      4MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   31C    P0    50W / 400W |      4MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   31C    P0    53W / 400W |      4MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:00:0A.0 Off |                    0 |
| N/A   28C    P0    53W / 400W |      4MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:00:0B.0 Off |                    0 |
| N/A   29C    P0    53W / 400W |      4MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:00:0C.0 Off |                    0 |
| N/A   31C    P0    52W / 400W |      4MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

The output of nvidia-smi topo -m indicates that everything is connected with PHB:

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity
GPU0	 X 	PHB	PHB	PHB	PHB	PHB	PHB	PHB	0-95		N/A
GPU1	PHB	 X 	PHB	PHB	PHB	PHB	PHB	PHB	0-95		N/A
GPU2	PHB	PHB	 X 	PHB	PHB	PHB	PHB	PHB	0-95		N/A
GPU3	PHB	PHB	PHB	 X 	PHB	PHB	PHB	PHB	0-95		N/A
GPU4	PHB	PHB	PHB	PHB	 X 	PHB	PHB	PHB	0-95		N/A
GPU5	PHB	PHB	PHB	PHB	PHB	 X 	PHB	PHB	0-95		N/A
GPU6	PHB	PHB	PHB	PHB	PHB	PHB	 X 	PHB	0-95		N/A
GPU7	PHB	PHB	PHB	PHB	PHB	PHB	PHB	 X 	0-95		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (for example, QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Restart your machine.

After reboot you should see your GPUs connected with NVLink when running nvidia-smi topo -m:

paperspace@pspe7ld5x:~$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity
GPU0	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	0-95		N/A
GPU1	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NV12	0-95		N/A
GPU2	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NV12	0-95		N/A
GPU3	NV12	NV12	NV12	 X 	NV12	NV12	NV12	NV12	0-95		N/A
GPU4	NV12	NV12	NV12	NV12	 X 	NV12	NV12	NV12	0-95		N/A
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	NV12	NV12	0-95		N/A
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	NV12	0-95		N/A
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	0-95		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (for example, QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks