How to Enable or Disable NVLink

Machines are Linux and Windows virtual machines with persistent storage, GPU options, and free unlimited bandwidth. They’re designed for high-performance computing (HPC) workloads.


NVLink is a high-speed GPU interconnect developed by NVIDIA. NVLink improves data transfer speeds and scalability for high-performance computing tasks across multiple GPUs.

How you handle NVLink on your machine depends on the type of machine you’re using:

  • Machines using H100x8 GPUs come with NVLink enabled. You don’t need to enable NVLink manually.

  • Machines using H100x1 or A100-80Gx1 GPUs come with NVLink enabled. You need to disable NVLink in order to run CUDA.

    You also need NVIDIA drivers and the NVIDIA CUDA toolkit installed as described in this article, but do not need Fabric Manager.

  • SSH-only machines do not come with NVLink enabled. You can choose to manually enable NVLink.

Tip

As an alternative to manually enabling NVLink, which can be complex and error-prone, we recommend creating machines using the the ML-in-a-Box template instead. ML-in-a-Box provides the data science stack needed for HPC tasks without the need to manually configure NVLink.

For improved performance on ML-in-a-Box machines, we recommend disabling the desktop environment. To do so, change the default startup target from a graphical interface to a non-graphical interface, and then reboot the machine:

sudo systemctl set-default multi-user.target 
sudo reboot

If your use case requires NVLink, you can manually enable it using the instructions below.

To enable NVLink, you must:

  1. Update your machine’s packages and verify the compatibility of its GPUs.
  2. Install the NVIDIA CUDA Toolkit.
  3. Install CUDA Drivers and NVSMI.
  4. Install NVIDIA Fabric Manager.

Update Packages and Verify GPU Compatibility

Before installing the software necessary to enable NVLink, update the machine’s package index and packages to the latest versions and verify that the machine’s GPUs are compatible with NVLink.

First, connect to your machine and open a terminal. Then, update your machine’s packages.

sudo apt-get update && apt-get upgrade -y

Next, identify your machine’s GPUs by listing the PCI devices on your machine and filtering by NVIDIA.

lspci | grep NVIDIA

The output displays the GPU model names:

00:05.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:06.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
...

Search NVIDIA’s website for the GPU model and confirm that the NVLink is listed as a supported interconnect. For example, NVIDIA’s page on the A100 GPU includes a specifications table at the bottom of the page with a row for interconnects.

Install the NVIDIA CUDA Toolkit

NVCC, the CUDA compiler driver, compiles CUDA code into executable programs. Installing the NVIDIA CUDA Toolkit lets you use NVCC and other CUDA tools to develop and run CUDA applications.

Use the CUDA Toolkit and driver compatibility table to find the right version for your machine.

Download the repository pinning file with the appropriate version and move it into the APT preferences directory, which handles package priorities. For example, on Ubuntu 22.04:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600

Similarly, download and install NVIDIA’s CUDA repository Debian package, which contains the NVIDIA CUDA Toolkit. For example, with version 12.4 on Ubuntu 22.04

wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb

Copy the repository’s GPG key to the machine’s keyring directory, which securely authenticates packages from the repository. For example, for version 12.4 on Ubuntu 22.04:

sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/

Update the machine’s package lists to incorporate the new changes and then install the NVIDIA CUDA Toolkit from the repository you added. For example, for version 12.4:

sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4

Once you have the NVIDIA CUDA Toolkit installed on your machine, you can verify the version of NVCC with nvcc --version.

If you installed the toolkit but the nvcc command isn’t found, the toolkit may not be on the PATH of your machine.

Click to expand instructions on updating your machine’s PATH.

To add the toolkit to your PATH, add the following lines to the ~/.profile file:

export PATH=/etc/alternatives/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/etc/alternatives/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Save the file, and then apply the changes:

source ~/.profile

Run the nvcc --version command again to confirm the fix.

Install CUDA Drivers and NVSMI

NVSMI, the NVIDIA System Management Interface, monitors and manages NVIDIA GPU devices by providing access to GPU settings, configuration details, performance, and real-time statuses. It also shows how GPUs are interconnected, either through PCIe or NVLink.

The NVIDIA CUDA drivers, which are necessary to use NVLink, include NVSMI. Install the CUDA drivers on your machine:

sudo apt-get install -y cuda-drivers

You can confirm that NVSMI is also installed by running nvidia-smi, which outputs information about your machine’s GPUs.

Install NVIDIA Fabric Manager

NVIDIA Fabric Manager manages fabric resources, such as NVLink, and is essential for machines involving complex GPU interconnects, such as configuring and allocating NVLink connections.

Install Fabric Manager on your machine and start it:

sudo apt-get install cuda-drivers-fabricmanager-550 -y
sudo systemctl start nvidia-fabricmanager

Once NVLink is enabled, you can test the connections between your machine’s GPUs and confirm that NVLink is functioning, then test the CUDA environment.

Check the Connectivity Matrix

Use NVSMI to output information about your GPUs:

nvidia-smi

In the output, look for the NVLink GPU Peer-to-Peer Connectivity Matrix:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA H100      Off     | 00000000:00:1A.0 Off |                    0 |
| N/A   32C    P0    40W / 300W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
...
+-----------------------------------------------------------------------------+
| NVLink GPU Peer-to-Peer Connectivity Matrix                                 |
|                                                                             |
|     GPU0    GPU1                                                            |
|     0       1                                                               |
| 0   X       NV1                                                             |
| 1   NV1     X                                                               |
+-----------------------------------------------------------------------------+
|   NV1  Enabled                                                              |
+-----------------------------------------------------------------------------+

NVLink is enabled if you see “NV1 Enabled” or “NV2 Enabled”.

Check the GPU Topology

Next, use NVSMI to display the GPU topology, or how the GPUs are connected to each other.

nvidia-smi topo -m

The output displays a table with connectivity details between each GPU:

        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
GPU0     X      NV1     NV1     NV2     0-15            N/A
GPU1    NV1      X      NV1     NV2     0-15            N/A
...

The GPUs are interconnected with NVLink if you see “NV1” or “NV2” in the output.

Finally, check the status of each NVLink connection for each GPU.

nvidia-smi nvlink --status

This command outputs information about each NVLink connection, like its utilization and its active or inactive status.

GPU 0: NVIDIA H100
    Link 0: 250 GB/s - Active
    Link 1: 250 GB/s - Active
    Link 2: 250 GB/s - Inactive
    Link 3: 250 GB/s - Active
    Link 4: 250 GB/s - Active
    Link 5: 250 GB/s - Active

GPU 1: NVIDIA H100
    Link 0: 250 GB/s - Active
    Link 1: 250 GB/s - Active
    Link 2: 250 GB/s - Active
    Link 3: 250 GB/s - Active
    Link 4: 250 GB/s - Inactive
    Link 5: 250 GB/s - Active
...

If NVLink has an inactive status on all links, then there’s an issue with the configuration. To troubleshoot, repeat the above steps for installing the necessary software and testing the GPU and NVLink connections. If going through the steps again doesn’t fix the issue, reboot the machine with sudo reboot.

For further assistance, contact Paperspace support.

Test the CUDA Environment

After verifying and enabling NVLink, test your machine’s CUDA environment using CUDA samples, which are a collection of samples created by NVIDIA. These samples are used to configure and test CUDA Toolkit features, such as NVLink.

Clone the CUDA Samples repository.

git clone https://github.com/NVIDIA/cuda-samples

The deviceQuery sample is a utility that provides information about the CUDA devices on your machine. It verifies that the system recognizes the GPUs and displays their capabilities, such as NVLink connection.

Move to the deviceQuery directory and compile the sample using the provided Makefile:

cd cuda-samples/Samples/1_Utilities/deviceQuery
make

You can alternatively compile the sample with nvcc -o deviceQuery deviceQuery.cu.

Finally, run the compiled program:

./deviceQuery

This validates that the CUDA environment is set up correctly and NVLink is connecting the GPUs within your machine.

Device 0: "NVIDIA H100 PCIe"
...
  NVLink capability:                             Supported
    P2P Access between GPUs:                     Yes

If NVLink doesn’t appear in the deviceQuery output, then there is an issue with either the hardware setup, the driver configuration, or software configuration.

To disable NVLink on your machine, you need to disable it both at the system level and on the RAM disk (initrd).

At the System Level

First, create a backup of the GRUB configuration file with the current date for identification. You can restore from this file in event of an error.

sudo cp /etc/default/grub /etc/default/grub.backup_$(date +%Y-%m-%d)

Next, open /etc/default/grub with a text editor and update the GRUB_CMDLINE_LINUX_DEFAULT value to disable NVLink.

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvlink.disable=1"

Save and close the file, then update GRUB and reboot your machine.

sudo update-grub
sudo reboot

If your machine does not boot, use your GRUB backup file to restore the original configuration and try again.

Click to expand instructions on restoring from the GRUB backup file.

First, print a list of your machine’s disk partitions.

sudo fdisk -l

Locate the partition that has the Linux filesystem. This has a dev/sdXn naming pattern. For example, if the device name is /dev/sda1, sda represents your hard disk and the number represents the partition number.

Mount the partition where your Ubuntu system is installed, replacing dev/sdXn with the appropriate system partition from the previous step.

sudo mount /dev/sdXn /mnt

Mount the necessary directories.

sudo mount --bind /dev /mnt/dev
sudo mount --bind /proc /mnt/proc
sudo mount --bind /sys /mnt/sys
sudo mount --bind /run /mnt/run

Change the root directory to your system’s partition.

sudo chroot /mnt

This creates an environment, often called a chroot jail, where programs cannot access files outside of the directory specified.

Restore the GRUB configuration file using your backup file.

sudo cp $(ls -Art /etc/default/grub.backup* | tail -n 1) /etc/default/grub

The * in the command matches filenames that start with grub.backup, such as the filename with the backup’s creation date, in the /etc/default directory.

Update your GRUB configuration file, exit the chroot jail, and reboot your machine.

update-grub
exit
sudo reboot

On the RAM Disk

Update your RAM disk to ensure NVLink is disabled when you start up your machine.

Create a backup of the current initrd file with the current date for identification. You can restore from this file in event of an error.

sudo cp /boot/initrd.img-$(uname -r) /boot/initrd.img-$(uname -r).backup_$(date +%Y-%m-%d)

Then, to disable NVLink, modify the initramfs configurations with one of the following options:

  • Create a modprobe configuration file that denylists NVIDIA NVLink modules:

    echo "blacklist <module_name>" | sudo tee /etc/modprobe.d/nvlink-denylist.conf
    

    This does not change initramfs but makes initramfs respect your denylist.

  • Write an initramfs-tools script to disable NVLink. For example, create /etc/initramfs-tools/scripts/new-init/disable-nvlink.sh, and enter a script like the following that disables specific NVIDIA kernel modules:

    #!/bin/sh
    
    modprobe -r nvidia_nvlink
    modprobe -r nvidia_uvm
    

    Save and close the file. Then, make the script executable and rebuild initrd for your running kernel.

    sudo chmod +x /etc/initramfs-tools/scripts/new-init/disable_nvlink.sh
    sudo update-initramfs -u
    

    To disable NVLink across all Linux kernel versions on your machine (if you have a multi-boot environment or multiple kernel versions for testing or compatibility reasons), update all installed kernels:

    sudo update-initramfs -c -k all
    

Finally, reboot your machine.

sudo reboot

If your RAM disk is corrupted after a reboot, restore your RAM disk using your backup file and reboot your machine.

sudo cp /boot/initrd.img-$(uname -r).backup_$(date +%Y-%m-%d) /boot/initrd.img-$(uname -r)
sudo reboot

After the reboot, you can try again.