Description of image

How to Configure NVLink on Machines

Machines are high-performing computing for scaling AI applications.


For non-ML-in-a-Box or terminal/SSH-only (headless) machines, configuring NVLink can make data transfer faster, improve scalability, reduce latency, and optimize utilization of resources, which are important for high performance tasks and computing.

Note

ML-in-a-Box immediately provides the data science stack needed for high performance tasks and computing, and avoids manual configuring that could lead to issues on the machine. We strongly recommend that you run ML-in-a-Box instead of configuring NVLink on non-ML-in-a-Box or headless machines. If you want, you can disable desktop streaming from ML-in-a-Box using the following command:

sudo systemctl set-default multi-user.target 

Then, reboot your system.

sudo reboot

Prerequisites

NVIDIA CUDA Compiler Driver (NVCC) and NVIDIA System Management Interface (NVSMI) are essential for NVLink connection. Before configuring NVLink, identify the GPUs on the machine and check if tools are installed.

Identify GPUs

Identifying the GPUs on the machine and checking whether they have the proper drivers loaded verifies that those GPUs are compatible for NVLink connection. Review the details about all the PCI buses and devices on the machine . The following output shows a list of PCI devices such as multiple NVIDIA A100 GPUs on a Paperspace machine.

paperspace@pstc7tvpw3ml:~$ lspci
...
00:05.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:06.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:07.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:08.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:09.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:0a.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:0b.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
00:0c.0 3D controller: NVIDIA Corporation GA100 [A100 SXM4 80GB] (rev a1)
...

Alternatively, you can use lspci | grep NVIDIA, which specifically identifies NVIDIA GPUs, and lspci -v, which provides verbose information about each PCI device.

Verify If NVIDIA CUDA Compiler Driver Is Installed

NVIDIA CUDA Toolkit includes NVCC, which compiles CUDA code into executable programs. Check the current NVCC version on the machine with the nvcc --version command. Identifying the version running on the machine is important because it verifies whether the version is compatible for NVLink connection.

paperspace@pstc7tvpw3ml:~$ nvcc --version

If NVCC is available on the machine, then you can skip to verifying the NVIDIA CUDA Drivers installation.

If NVCC is not found on the machine, it is either because the machine doesn’t have NVIDIA CUDA Toolkit installed or the NVIDIA CUDA Toolkit isn’t on the machine’s PATH. Follow the NVIDIA CUDA Toolkit instructions when configuring NVLink to install the NVIDIA CUDA Toolkit.

If the NVIDIA CUDA Toolkit isn’t on the PATH of the machine, then type the following commands to add the toolkit to the machine’s PATH.

export PATH=/etc/alternatives/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/etc/alternatives/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

To permanently add the toolkit to the machine’s PATH, open a terminal, and navigate to the machine’s home directory. List all the files by running the ls -a command. Then, use a text editor of your choice to add the commands to the .profile file. For example, nano .profile.

Save and exit the .profile file, and apply the change by either restarting the terminal or sourcing the .profile file with the source ~/.profile command.

Verify If NVIDIA System Management Interface Is Installed

NVIDIA CUDA Drivers includes NVSMI, which monitors and manages NVIDIA GPU devices by providing access to GPU settings and configuration details, GPU performance, and their real-time statuses. It also shows how GPUs are interconnected, which is either through PCIe or NVLink.

Use the nvidia-smi command in your terminal to see whether the GPUs on the machine are compatible with NVLink connection and the hardware is up-to-date and functioning correctly.

paperspace@pstc7tvpm3ml:~$ nvidia-smi

If NVSMI is available on the machine, then you can skip to installing the fabric manager. If NVSMI isn’t found on the machine, install the NVIDIA CUDA Drivers to install NVSMI on the machine.

Update Machine

After identifying the GPUs and verifying that NVCC and NVSMI are installed, ensure that the machine is up-to-date. Manage the machine’s system using the sudo su - command. Then, update the machine’s system software by executing the sudo apt-get update && apt-get upgrade -y command.

Install NVIDIA CUDA Toolkit

Installing the NVIDIA CUDA Toolkit allows you to use NVCC and other CUDA tools for developing and running CUDA applications. To find the correct NVIDIA CUDA Toolkit for the drivers on the machine, see NVIDIA’s CUDA Toolkit and driver compatibility table. For example, if the machine uses Ubuntu 22.04, download NVIDIA’s Ubuntu 22.04 CUDA Toolkit repository pinning file:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin

Move the pinning file to the APT preferences directory, which handles package priorities.

sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600

Download NVIDIA’s Ubuntu 22.04 CUDA repository Debian (.deb) package.

wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb

Install the CUDA repository package which contains the NVIDIA CUDA Toolkit and adds the repository to the APT system.

sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb

Copy the repository’s GPG key to the machine’s keyring directory. This is necessary for authenticating packages from the repository securely.

sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/

Update the machine’s package lists which incorporates the new changes made:

sudo apt-get update

Lastly, install the NVIDIA CUDA Toolkit 12.4 from the repository you added earlier. This command ensures that the toolkit is installed correctly with all the required configurations and priorities set.

sudo apt-get -y install cuda-toolkit-12-4

Install NVIDIA CUDA Drivers

Install CUDA drivers on the machine with the cuda-drivers flag which ensures that the GPUs on the machine have proper drivers to use NVLink connection.

sudo apt-get install -y cuda-drivers

Install NVIDIA Fabric Manager

NVIDIA Fabric Manager manages fabric resources such as NVLink and is important for setups involving complex GPU interconnects such as configuring and allocating NVLink connections. Install the CUDA driver’s Fabric Manager 550 on the machine.

sudo apt-get install cuda-drivers-fabricmanager-550 -y

Start the NVIDIA Fabric Manager which is required to manage and optimize NVLink fabric resources.

sudo systemctl start nvidia-fabricmanager

After configuring NVLink, verify the connections between GPUs and whether NVLink is active and functioning properly.

View how all the GPUs are connected and if they are working as expected using the nvidia-smi command. The command also shows connectivity information such as whether NVLink is enabled and connecting the GPUs.

Review the topology of the machine to see how the GPUs are interconnected by running the nvidia-smi topo -m command.

Check the status of each NVLink connection for each GPU. Run the nvidia-smi nvlink --status command to see the information about each NVLink including the utilization and active or inactive status.

The NVLink connection isn’t configured correctly if NVLink is disabled and inactive. To troubleshoot, check if you have installed all the up-to-date tools and test GPU and NVLink connection. If these steps don’t fix the issue, then reboot the machine with sudo reboot. If you need further assistance, contact support.

Test Using CUDA Samples

After verifying the basic installation and functionality of the machine’s CUDA environment, test the environment further with CUDA samples. CUDA samples are a collection of examples made by NVIDIA and are used to configure and test the CUDA Toolkit features such as NVLink. Clone the CUDA Samples repository.

git clone https://github.com/NVIDIA/cuda-samples

Navigate to the deviceQuery sample which is a utility that provides information about the CUDA devices on the machine. It is used to verify that the system recognizes the GPUs and to display their capabilities such as NVLink.

cd cuda-samples/Samples/1_Utilities/deviceQuery

Compile the deviceQuery sample with a Makefile from the sample directory. The Makefile has instructions for compiling deviceQuery. Alternatively, You can also run nvcc -o deviceQuery deviceQuery.cu as an alternative to using the Makefile to compile deviceQuery sample.

Lastly, use the ./deviceQuery command, execute the compiled deviceQuery program which shows details about NVLink support. This validates whether the CUDA environment is set up correctly and that NVLink is supported and is used to connect the GPUs within the machine. If NVLink doesn’t appear in the deviceQuery output, then there is an issue with the hardware setup or the driver/software configuration.