How do I fix a "system not initialized" error on multi-GPU Droplets?

Validated on 8 Apr 2025 • Last edited on 17 Apr 2025

NVIDIA Fabric Manager configures the intra-node GPU fabric that allows NVSwitch-connected GPUs to communicate with each other. If Fabric Manager is not running on a GPU Droplet that has multiple NVSwitch-connected GPUs, like our H100x8 offering, then multi-GPU workloads fail to start.

If you receive one of the following “system not yet initialized” errors when launching a GPU workload on a GPU Droplet, it typically indicates a problem with NVIDIA Fabric Manager:

Example error
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize.  GPU functionality will not be available.
   [[ System not yet initialized (error 802) ]]

/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
Example error
[2025-03-31 12:11:36] test-8x-gpu:1815005:1815050 [1] transport/nvls.cc:254 NCCL WARN Cuda failure 802 'system not yet initialized'

[2025-03-31 12:11:36] test-8x-gpu:1815005:1815052 [3] transport/nvls.cc:254 NCCL WARN Cuda failure 802 'system not yet initialized'

[2025-03-31 12:11:36] test-8x-gpu:1815005:1815051 [2] transport/nvls.cc:254 NCCL WARN Cuda failure 802 'system not yet initialized'

[2025-03-31 12:11:36] test-8x-gpu:1815005:1815055 [6] transport/nvls.cc:254 NCCL WARN Cuda failure 802 'system not yet initialized'

To confirm, check the state and status of Fabric Manager:

nvidia-smi -q | grep -A 2 Fabric

The output displays the status for all GPUs:

Output
--
    Fabric
        State                             : In Progress
        Status                            : N/A
--
    Fabric
        State                             : In Progress
        Status                            : N/A
--
    Fabric
        State                             : In Progress
        Status                            : N/A
--
[...]

If the state of all 8 GPUs is not “Completed” or “Success”, it means Fabric Manager has not initialized the system correctly.

To fix this error, try the following debugging steps in order:

  1. Start Fabric Manager.
  2. Reset the GPUs.
  3. Align the Fabric Manager and driver versions.

Start Fabric Manager

Check that Fabric Manager is running:

systemctl status nvidia-fabricmanager

In the output, look for the Active line:

Output
● nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2025-03-20 15:32:05 UTC; 1 week 3 days ago
   Main PID: 2941 (nv-fabricmanage)
      Tasks: 17 (limit: 629145)
     Memory: 9.7M
        CPU: 3min 56.415s
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─2941 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

If Fabric Manager isn’t running, start it:

systemctl start nvidia-fabricmanager
Output
Mar 20 15:32:01 test-8x-gpu systemd[1]: Starting NVIDIA fabric manager service...
Mar 20 15:32:05 test-8x-gpu nv-fabricmanager[2941]: Connected to 1 node.
Mar 20 15:32:05 test-8x-gpu nv-fabricmanager[2941]: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPU>
Mar 20 15:32:05 test-8x-gpu systemd[1]: Started NVIDIA fabric manager service.

Once it’s running, try the workload again.

Reset the GPUs

If Fabric Manager is running but you get the same error, try resetting the GPUs:

nvidia-smi -r
Output
GPU 00000000:01:01.0 was successfully reset.
GPU 00000000:02:01.0 was successfully reset.
GPU 00000000:03:01.0 was successfully reset.
GPU 00000000:04:01.0 was successfully reset.
GPU 00000000:05:01.0 was successfully reset.
GPU 00000000:06:01.0 was successfully reset.
GPU 00000000:07:01.0 was successfully reset.
GPU 00000000:08:01.0 was successfully reset.

Note: The operation has successfully reset all GPUs and NVSwitches. If the services, such as nvidia-fabricmanager, which manage or monitor NVSwitches are running, they might have been affected by this operation. Please refer respective service status or logs for details.
All done.

Then restart Fabric Manager:

systemctl restart nvidia-fabricmanager

Finally, reboot the Droplet and try the workload again.

Align Fabric Manager and Driver Versions

If you still receive errors after rebooting, check that Fabric Manager is running after the reboot:

systemctl status nvidia-fabricmanager

In this output, the Fabric Manager service failed to start:

Output
× nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2025-04-02 15:45:55 UTC; 1min 48s ago
    Process: 11711 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE)
        CPU: 4ms

Often, this happens because of a mismatch between the Fabric Manager version and the GPU driver version. Mismatched versions commonly occur after updating installed packages without restarting or reloading the kernel module.

To fix mismatched versions, first try updating all installed packages to their latest versions:

apt-get update
apt-get upgrade

Then, reboot the Droplet. If the service still isn’t running successfully, you can manually align the versions of Fabric Manager and the GPU drivers.

Look at the output of the Fabric Manager logs:

tail /var/log/fabricmanager.log

Look for a line that specifies the versions. The exact version numbers vary depending on your configuration.

In this example, the driver is on version 535.216.01 and Fabric Manager is on version 535.230.02:

Output
[Apr 02 2025 15:45:55] [ERROR] [tid 11713] fabric manager NVIDIA GPU driver interface version 535.230.02 don't match with driver version 535.216.01. Please update with matching NVIDIA driver package.

Fabric Manager is ahead of the driver version, so one fix is to downgrade the version of Fabric Manager to match the driver.

To find the right package name, search for the major version of the release and filter the output based on the full driver version:

apt-cache madison nvidia-fabricmanager 535 | grep 535.216.01

In the output, look for the matching Fabric Manager version, which is often (but not always) the same version with -1 appended:

Output
nvidia-fabricmanager-535 | 535.216.01-1 | https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

Continuing with this example, you can install this specific version of Fabric Manager by specifying the version:

apt-get install nvidia-fabricmanager-535=535.216.01-1
Output
[...]
The following packages will be DOWNGRADED:
  nvidia-fabricmanager-535
[...]
dpkg: warning: downgrading nvidia-fabricmanager-535 from 535.230.02-1 to 535.216.01-1
(Reading database ... 103247 files and directories currently installed.)
Preparing to unpack .../nvidia-fabricmanager-535_535.216.01-1_amd64.deb ...
Unpacking nvidia-fabricmanager-535 (535.216.01-1) over (535.230.02-1) ...
[...]

When the installation is complete, reset the GPUs and restart Fabric Manager as before:

nvidia-smi -r
systemctl restart nvidia-fabricmanager

Finally, try your workload again.

Prevention

You can prevent version mismatches between Fabric Manager and the driver by always updating them together followed by either rebooting the Droplet or reloading the driver and Fabric Manager service.

The default enabled unattended-upgrades process can cause packages to be updated in the background which can sometimes lead to this issue. To disable this functionality, first implement a manual process to keep the system current with security fixes, then modify the content of /etc/apt/apt.conf.d/20auto-upgrades to contain the following:

/etc/apt/apt.conf.d/20auto-upgades
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "0";

This prevents background updates and leaves the system using the existing versions of both the driver and Fabric Manager until you update them.

Additional Support

If you still have errors after following these steps, open a support ticket and include the saved text output of the following commands:

  • systemctl status nvidia-fabricmanager
  • cat /var/log/fabricmanager.log
  • nvidia-smi -q
  • dpkg -l

The output of these commands is necessary for us to diagnose your issue, so include them to help us resolve your issue quickly.

Why am I getting a Droplet autoscale pool error?

There may be an issue with the autoscale pool or Droplet configuration, the VPC network’s size, or resource limits on the team or datacenter.

How to Troubleshoot Load Balancer Health Check Issues

Health checks often fail due to firewalls or misconfigured backend server software.

How do I recover access to a Droplet that is at 100% disk usage?

Boot the Droplet from the recovery ISO, then connect using the Recovery Console and delete files to free some disk space.

We can't find any results for your search.

Try using different keywords or simplifying your search terms.