Machines are Linux and Windows virtual machines with persistent storage, GPU options, and free unlimited bandwidth. They’re designed for high-performance computing (HPC) workloads.
NVLink is a high-speed GPU interconnect developed by NVIDIA. NVLink improves data transfer speeds and scalability for high-performance computing tasks across multiple GPUs.
How you handle NVLink on your machine depends on the type of machine you’re using:
Machines using H100x8 GPUs come with NVLink enabled. You don’t need to enable NVLink manually.
Machines using H100x1 or A100-80Gx1 GPUs come with NVLink enabled. You need to disable NVLink in order to run CUDA.
You also need NVIDIA drivers and the NVIDIA CUDA toolkit installed as described in this article, but do not need Fabric Manager.
SSH-only machines do not come with NVLink enabled. You can choose to manually enable NVLink.
As an alternative to manually enabling NVLink, which can be complex and error-prone, we recommend creating machines using the the ML-in-a-Box template instead. ML-in-a-Box provides the data science stack needed for HPC tasks without the need to manually configure NVLink.
For improved performance on ML-in-a-Box machines, we recommend disabling the desktop environment. To do so, change the default startup target from a graphical interface to a non-graphical interface, and then reboot the machine:
sudo systemctl set-default multi-user.target
sudo reboot
If your use case requires NVLink, you can manually enable it using the instructions below.
To enable NVLink, you must:
Before installing the software necessary to enable NVLink, update the machine’s package index and packages to the latest versions and verify that the machine’s GPUs are compatible with NVLink.
First, connect to your machine and open a terminal. Then, update your machine’s packages.
sudo apt-get update && apt-get upgrade -y
Next, identify your machine’s GPUs by listing the PCI devices on your machine and filtering by NVIDIA.
lspci | grep NVIDIA
The output displays the GPU model names:
00:05.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:06.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
...
Search NVIDIA’s website for the GPU model and confirm that the NVLink is listed as a supported interconnect. For example, NVIDIA’s page on the A100 GPU includes a specifications table at the bottom of the page with a row for interconnects.
NVCC, the CUDA compiler driver, compiles CUDA code into executable programs. Installing the NVIDIA CUDA Toolkit lets you use NVCC and other CUDA tools to develop and run CUDA applications.
Use the CUDA Toolkit and driver compatibility table to find the right version for your machine.
Download the repository pinning file with the appropriate version and move it into the APT preferences directory, which handles package priorities. For example, on Ubuntu 22.04:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
Similarly, download and install NVIDIA’s CUDA repository Debian package, which contains the NVIDIA CUDA Toolkit. For example, with version 12.4 on Ubuntu 22.04
wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
Copy the repository’s GPG key to the machine’s keyring directory, which securely authenticates packages from the repository. For example, for version 12.4 on Ubuntu 22.04:
sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
Update the machine’s package lists to incorporate the new changes and then install the NVIDIA CUDA Toolkit from the repository you added. For example, for version 12.4:
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4
Once you have the NVIDIA CUDA Toolkit installed on your machine, you can verify the version of NVCC with nvcc --version
.
If you installed the toolkit but the nvcc
command isn’t found, the toolkit may not be on the PATH of your machine.
NVSMI, the NVIDIA System Management Interface, monitors and manages NVIDIA GPU devices by providing access to GPU settings, configuration details, performance, and real-time statuses. It also shows how GPUs are interconnected, either through PCIe or NVLink.
The NVIDIA CUDA drivers, which are necessary to use NVLink, include NVSMI. Install the CUDA drivers on your machine:
sudo apt-get install -y cuda-drivers
You can confirm that NVSMI is also installed by running nvidia-smi
, which outputs information about your machine’s GPUs.
NVIDIA Fabric Manager manages fabric resources, such as NVLink, and is essential for machines involving complex GPU interconnects, such as configuring and allocating NVLink connections.
Install Fabric Manager on your machine and start it:
sudo apt-get install cuda-drivers-fabricmanager-550 -y
sudo systemctl start nvidia-fabricmanager
Once NVLink is enabled, you can test the connections between your machine’s GPUs and confirm that NVLink is functioning, then test the CUDA environment.
Use NVSMI to output information about your GPUs:
nvidia-smi
In the output, look for the NVLink GPU Peer-to-Peer Connectivity Matrix:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06 Driver Version: 520.56.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA H100 Off | 00000000:00:1A.0 Off | 0 |
| N/A 32C P0 40W / 300W | 0MiB / 40960MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
...
+-----------------------------------------------------------------------------+
| NVLink GPU Peer-to-Peer Connectivity Matrix |
| |
| GPU0 GPU1 |
| 0 1 |
| 0 X NV1 |
| 1 NV1 X |
+-----------------------------------------------------------------------------+
| NV1 Enabled |
+-----------------------------------------------------------------------------+
NVLink is enabled if you see “NV1 Enabled” or “NV2 Enabled”.
Next, use NVSMI to display the GPU topology, or how the GPUs are connected to each other.
nvidia-smi topo -m
The output displays a table with connectivity details between each GPU:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X NV1 NV1 NV2 0-15 N/A
GPU1 NV1 X NV1 NV2 0-15 N/A
...
The GPUs are interconnected with NVLink if you see “NV1” or “NV2” in the output.
Finally, check the status of each NVLink connection for each GPU.
nvidia-smi nvlink --status
This command outputs information about each NVLink connection, like its utilization and its active or inactive status.
GPU 0: NVIDIA H100
Link 0: 250 GB/s - Active
Link 1: 250 GB/s - Active
Link 2: 250 GB/s - Inactive
Link 3: 250 GB/s - Active
Link 4: 250 GB/s - Active
Link 5: 250 GB/s - Active
GPU 1: NVIDIA H100
Link 0: 250 GB/s - Active
Link 1: 250 GB/s - Active
Link 2: 250 GB/s - Active
Link 3: 250 GB/s - Active
Link 4: 250 GB/s - Inactive
Link 5: 250 GB/s - Active
...
If NVLink has an inactive status on all links, then there’s an issue with the configuration. To troubleshoot, repeat the above steps for installing the necessary software and testing the GPU and NVLink connections. If going through the steps again doesn’t fix the issue, reboot the machine with sudo reboot
.
For further assistance, contact Paperspace support.
After verifying and enabling NVLink, test your machine’s CUDA environment using CUDA samples, which are a collection of samples created by NVIDIA. These samples are used to configure and test CUDA Toolkit features, such as NVLink.
Clone the CUDA Samples repository.
git clone https://github.com/NVIDIA/cuda-samples
The deviceQuery
sample is a utility that provides information about the CUDA devices on your machine. It verifies that the system recognizes the GPUs and displays their capabilities, such as NVLink connection.
Move to the deviceQuery
directory and compile the sample using the provided Makefile:
cd cuda-samples/Samples/1_Utilities/deviceQuery
make
You can alternatively compile the sample with nvcc -o deviceQuery deviceQuery.cu
.
Finally, run the compiled program:
./deviceQuery
This validates that the CUDA environment is set up correctly and NVLink is connecting the GPUs within your machine.
Device 0: "NVIDIA H100 PCIe"
...
NVLink capability: Supported
P2P Access between GPUs: Yes
If NVLink doesn’t appear in the deviceQuery
output, then there is an issue with either the hardware setup, the driver configuration, or software configuration.
To disable NVLink on your machine, you need to disable it both at the system level and on the RAM disk (initrd
).
First, create a backup of the GRUB configuration file with the current date for identification. You can restore from this file in event of an error.
sudo cp /etc/default/grub /etc/default/grub.backup_$(date +%Y-%m-%d)
Next, open /etc/default/grub
with a text editor and update the GRUB_CMDLINE_LINUX_DEFAULT
value to disable NVLink.
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvlink.disable=1"
Save and close the file, then update GRUB and reboot your machine.
sudo update-grub
sudo reboot
If your machine does not boot, use your GRUB backup file to restore the original configuration and try again.
Update your RAM disk to ensure NVLink is disabled when you start up your machine.
Create a backup of the current initrd
file with the current date for identification. You can restore from this file in event of an error.
sudo cp /boot/initrd.img-$(uname -r) /boot/initrd.img-$(uname -r).backup_$(date +%Y-%m-%d)
Then, to disable NVLink, modify the initramfs
configurations with one of the following options:
Create a modprobe
configuration file that denylists NVIDIA NVLink modules:
echo "blacklist <module_name>" | sudo tee /etc/modprobe.d/nvlink-denylist.conf
This does not change initramfs
but makes initramfs
respect your denylist.
Write an initramfs-tools
script to disable NVLink. For example, create /etc/initramfs-tools/scripts/new-init/disable-nvlink.sh
, and enter a script like the following that disables specific NVIDIA kernel modules:
#!/bin/sh
modprobe -r nvidia_nvlink
modprobe -r nvidia_uvm
Save and close the file. Then, make the script executable and rebuild initrd
for your running kernel.
sudo chmod +x /etc/initramfs-tools/scripts/new-init/disable_nvlink.sh
sudo update-initramfs -u
To disable NVLink across all Linux kernel versions on your machine (if you have a multi-boot environment or multiple kernel versions for testing or compatibility reasons), update all installed kernels:
sudo update-initramfs -c -k all
Finally, reboot your machine.
sudo reboot
If your RAM disk is corrupted after a reboot, restore your RAM disk using your backup file and reboot your machine.
sudo cp /boot/initrd.img-$(uname -r).backup_$(date +%Y-%m-%d) /boot/initrd.img-$(uname -r)
sudo reboot
After the reboot, you can try again.