Description of image

How to Disable NVLink on Machines

Machines are high-performing computing for scaling AI applications.


NVLink is required to accelerate your training workloads on A100-80Gx8 machines on Ubuntu 22.04 machines. For any other machines using Ubuntu 22.04 such as H100x1 and A100-80Gx1 machines, you need to disable NVLink to ensure CUDA runs.

To disable NVLink, first disable it at the system level and then in the virtual machine’s (VM) RAM disk.

Disabling at the System Level

Backup GRUB Configuration File
  1. Open a Terminal window and navigate to your GRUB configuration file.
cd /etc/default/
  1. Create a backup of your GRUB configuration file using the following command, which adds the current date to the filename for future reference.
sudo cp grub grub.backup_$(date +%Y-%m-%d)

Open your GRUB configuration file for editing using the following command:

sudo nano /etc/default/grub

Update GRUB_CMDLINE_LINUX_DEFAULT to the following:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvlink.disable=1"

Complete the update with the following command:

sudo update-grub

You must reboot your system for the changes to take effect.

sudo reboot
Troubleshooting system after disabling NVLink

If your system does not boot up, you need to use your GRUB backup file and re-update the GRUB configuration file.

  1. Fetch a list of all the partitions on your system.
sudo fdisk -l

Locate the partition that has the Linux filesystem, which has a dev/sdXn naming pattern. For example, if the device name is /dev/sda1, sda represents your hard disk and the number represents the partition number. You may have multiple partitions, so note the partition you need to mount for the next step.

  1. Mount the partition where your Ubuntu system is installed. This process varies depending on your system’s setup. Enter the following command, replacing dev/sdXn with your system partition.
sudo mount /dev/sdXn /mnt
  1. Mount the necessary directories using the mount command:
sudo mount --bind /dev /mnt/dev
sudo mount --bind /proc /mnt/proc
sudo mount --bind /sys /mnt/sys
sudo mount --bind /run /mnt/run
  1. Change the root directory to your system’s partition using the chroot command.
sudo chroot /mnt
  1. Restore the GRUB configuration file using your backup file.
sudo cp $(ls -Art /etc/default/grub.backup* | tail -n 1) /etc/default/grub

The * in the command is used to match filenames that start with grub.backup, such as the filename with the backup’s creation date, in the /etc/default directory.

  1. Update your GRUB configuration file.
update-grub
  1. Exit from the chroot directory.
exit
  1. Reboot your system.
sudo reboot

Disabling in the RAM Disk

You need to update your RAM disk to ensure NVLink is disabled at start up.

Backup RAM Disk
  1. Locate your current initrd file.
ls /boot/initrd.img-*
  1. Create a copy with the current date for identification.
sudo cp /boot/initrd.img-$(uname -r) /boot/initrd.img-$(uname -r).backup_$(date +%Y-%m-%d)

To disable NVLink, modify the initramfs configurations using one of the following options:

  • Create a denylist for NVIDIA NVLink Modules

    Create a file in /etc/modprobe.d/ where you denylist the NVIDIA NVLink modules. This does not change initramfs but makes initramfs respect your denylist.

    echo "blacklist <module_name>" | sudo tee /etc/modprobe.d/nvlink-denylist.conf
    
  • Create a custom script in /etc/initramfs-tools/ to disable NVLink

    Create a custom shell script, /new-init/disable-nvlink.sh for example.

    sudo nano /etc/initramfs-tools/scripts/new-init/disable_nvlink.sh
    

Add commands that disable NVLink for your RAM disk. Here is an example disable_nvlink.sh script that disables specific NVIDIA kernel modules.

#!/bin/sh

modprobe -r nvidia_nvlink
modprobe -r nvidia_uvm

Ensure that the script is executable by running the following command:

sudo chmod +x /etc/initramfs-tools/scripts/new-init/disable_nvlink.sh

Rebuild initrd for your currently running kernel using the following command:

sudo update-initramfs -u

To ensure NVLink is disabled across all versions of the Linux kernel currently installed on your system, you may want to update all installed kernels. This includes multi-boot environments, systems with multiple kernel versions for testing or compatibility reasons, or maintaining consistency across all available boot options. To update all installed kernels, use the following command:

sudo update-initramfs -c -k all

Lastly, reboot your system.

sudo reboot
Troubleshooting RAM disk after disabling NVLink

If you need to use your backup, this command restores your RAM disk configurations using your backup file.

sudo cp /boot/initrd.img-$(uname -r).backup_$(date +%Y-%m-%d) /boot/initrd.img-$(uname -r)

Then, update your GRUB configuration file.

sudo update-grub