There may be an issue with the autoscale pool or Droplet configuration, the VPC network’s size, or resource limits on the team or datacenter.
How do I fix a "system not initialized" error on multi-GPU Droplets?
Validated on 8 Apr 2025 • Last edited on 17 Apr 2025
NVIDIA Fabric Manager configures the intra-node GPU fabric that allows NVSwitch-connected GPUs to communicate with each other. If Fabric Manager is not running on a GPU Droplet that has multiple NVSwitch-connected GPUs, like our H100x8 offering, then multi-GPU workloads fail to start.
If you receive one of the following “system not yet initialized” errors when launching a GPU workload on a GPU Droplet, it typically indicates a problem with NVIDIA Fabric Manager:
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available.
[[ System not yet initialized (error 802) ]]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
[2025-03-31 12:11:36] test-8x-gpu:1815005:1815050 [1] transport/nvls.cc:254 NCCL WARN Cuda failure 802 'system not yet initialized'
[2025-03-31 12:11:36] test-8x-gpu:1815005:1815052 [3] transport/nvls.cc:254 NCCL WARN Cuda failure 802 'system not yet initialized'
[2025-03-31 12:11:36] test-8x-gpu:1815005:1815051 [2] transport/nvls.cc:254 NCCL WARN Cuda failure 802 'system not yet initialized'
[2025-03-31 12:11:36] test-8x-gpu:1815005:1815055 [6] transport/nvls.cc:254 NCCL WARN Cuda failure 802 'system not yet initialized'
To confirm, check the state and status of Fabric Manager:
nvidia-smi -q | grep -A 2 Fabric
The output displays the status for all GPUs:
--
Fabric
State : In Progress
Status : N/A
--
Fabric
State : In Progress
Status : N/A
--
Fabric
State : In Progress
Status : N/A
--
[...]
If the state of all 8 GPUs is not “Completed” or “Success”, it means Fabric Manager has not initialized the system correctly.
To fix this error, try the following debugging steps in order:
Start Fabric Manager
Check that Fabric Manager is running:
systemctl status nvidia-fabricmanager
In the output, look for the Active line:
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2025-03-20 15:32:05 UTC; 1 week 3 days ago
Main PID: 2941 (nv-fabricmanage)
Tasks: 17 (limit: 629145)
Memory: 9.7M
CPU: 3min 56.415s
CGroup: /system.slice/nvidia-fabricmanager.service
└─2941 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
If Fabric Manager isn’t running, start it:
systemctl start nvidia-fabricmanager
Mar 20 15:32:01 test-8x-gpu systemd[1]: Starting NVIDIA fabric manager service...
Mar 20 15:32:05 test-8x-gpu nv-fabricmanager[2941]: Connected to 1 node.
Mar 20 15:32:05 test-8x-gpu nv-fabricmanager[2941]: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPU>
Mar 20 15:32:05 test-8x-gpu systemd[1]: Started NVIDIA fabric manager service.
Once it’s running, try the workload again.
Reset the GPUs
If Fabric Manager is running but you get the same error, try resetting the GPUs:
nvidia-smi -r
GPU 00000000:01:01.0 was successfully reset.
GPU 00000000:02:01.0 was successfully reset.
GPU 00000000:03:01.0 was successfully reset.
GPU 00000000:04:01.0 was successfully reset.
GPU 00000000:05:01.0 was successfully reset.
GPU 00000000:06:01.0 was successfully reset.
GPU 00000000:07:01.0 was successfully reset.
GPU 00000000:08:01.0 was successfully reset.
Note: The operation has successfully reset all GPUs and NVSwitches. If the services, such as nvidia-fabricmanager, which manage or monitor NVSwitches are running, they might have been affected by this operation. Please refer respective service status or logs for details.
All done.
Then restart Fabric Manager:
systemctl restart nvidia-fabricmanager
Finally, reboot the Droplet and try the workload again.
Align Fabric Manager and Driver Versions
If you still receive errors after rebooting, check that Fabric Manager is running after the reboot:
systemctl status nvidia-fabricmanager
In this output, the Fabric Manager service failed to start:
× nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2025-04-02 15:45:55 UTC; 1min 48s ago
Process: 11711 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE)
CPU: 4ms
Often, this happens because of a mismatch between the Fabric Manager version and the GPU driver version. Mismatched versions commonly occur after updating installed packages without restarting or reloading the kernel module.
To fix mismatched versions, first try updating all installed packages to their latest versions:
apt-get update
apt-get upgrade
Then, reboot the Droplet. If the service still isn’t running successfully, you can manually align the versions of Fabric Manager and the GPU drivers.
Look at the output of the Fabric Manager logs:
tail /var/log/fabricmanager.log
Look for a line that specifies the versions. The exact version numbers vary depending on your configuration.
In this example, the driver is on version 535.216.01 and Fabric Manager is on version 535.230.02:
[Apr 02 2025 15:45:55] [ERROR] [tid 11713] fabric manager NVIDIA GPU driver interface version 535.230.02 don't match with driver version 535.216.01. Please update with matching NVIDIA driver package.
Fabric Manager is ahead of the driver version, so one fix is to downgrade the version of Fabric Manager to match the driver.
To find the right package name, search for the major version of the release and filter the output based on the full driver version:
apt-cache madison nvidia-fabricmanager 535 | grep 535.216.01
In the output, look for the matching Fabric Manager version, which is often (but not always) the same version with -1
appended:
nvidia-fabricmanager-535 | 535.216.01-1 | https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages
Continuing with this example, you can install this specific version of Fabric Manager by specifying the version:
apt-get install nvidia-fabricmanager-535=535.216.01-1
[...]
The following packages will be DOWNGRADED:
nvidia-fabricmanager-535
[...]
dpkg: warning: downgrading nvidia-fabricmanager-535 from 535.230.02-1 to 535.216.01-1
(Reading database ... 103247 files and directories currently installed.)
Preparing to unpack .../nvidia-fabricmanager-535_535.216.01-1_amd64.deb ...
Unpacking nvidia-fabricmanager-535 (535.216.01-1) over (535.230.02-1) ...
[...]
When the installation is complete, reset the GPUs and restart Fabric Manager as before:
nvidia-smi -r
systemctl restart nvidia-fabricmanager
Finally, try your workload again.
Prevention
You can prevent version mismatches between Fabric Manager and the driver by always updating them together followed by either rebooting the Droplet or reloading the driver and Fabric Manager service.
The default enabled unattended-upgrades
process can cause packages to be updated in the background which can sometimes lead to this issue. To disable this functionality, first implement a manual process to keep the system current with security fixes, then modify the content of /etc/apt/apt.conf.d/20auto-upgrades
to contain the following:
/etc/apt/apt.conf.d/20auto-upgades
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "0";
This prevents background updates and leaves the system using the existing versions of both the driver and Fabric Manager until you update them.
Additional Support
If you still have errors after following these steps, open a support ticket and include the saved text output of the following commands:
systemctl status nvidia-fabricmanager
cat /var/log/fabricmanager.log
nvidia-smi -q
dpkg -l
The output of these commands is necessary for us to diagnose your issue, so include them to help us resolve your issue quickly.
Related Topics
Health checks often fail due to firewalls or misconfigured backend server software.
Boot the Droplet from the recovery ISO, then connect using the Recovery Console and delete files to free some disk space.