How to Enable GPU Metrics with DCGM

DigitalOcean Droplets are Linux-based virtual machines (VMs) that run on top of virtualized hardware. Each Droplet you create is a new server you can use, either standalone or as part of a larger, cloud-based infrastructure.


To access and monitor process statistics, health, and diagnostics data for the NVIDIA GPUs in GPU Droplets, we recommend using the NVIDIA Data Center GPU Manager (DCGM) and DCGM Exporter.

DCGM is a collection of monitoring, configuration management, and health diagnostics tools for NVIDIA GPUs. DCGM Exporter exposes GPU metrics at an HTTP endpoint where monitoring software, like Prometheus, can consume it.

Install DCGM

Create a single or 8 GPU Droplet using our provided AI/ML-ready image, which has NVIDIA drivers and software preinstalled and configured.

Alternatively, if you need to use a different base image, you can manually install drivers and software. See NVIDIA’s official installation instructions for more detail on the prerequisite configuration.

Then, on your GPU Droplet, install DCGM version 3.3.8 or later:

sudo apt-get install -y datacenter-gpu-manager

Next, restart systemd-journal to make sure DCGM logs are available:

sudo systemctl restart systemd-journald

8 GPU Droplets Only: Install NSCQ

On 8 GPU Droplets only, you additionally need to install the NVIDIA Switch Configuration and Query (NSCQ) library.

First, get the driver version branch of the NVIDIA drivers installed on the Droplet:

dpkg-query -W -f='${Version}\n' 'nvidia-driver-*' | grep -v '^nvidia-driver-common' | head -n1

The output shows the version number (for example, 535.183.01-0ubuntu1). Use the version number to install the matching NSCQ library packages. Substitute your Droplet’s driver version branch in the package name (for example, libnvidia-nscq-535=535.183.01-1).

sudo apt-get install -y libnvidia-nscq-DRIVER_VERSION_BRANCH

8 GPU Droplets also require NVIDIA Fabric Manager. Fabric Manager is already configured in the AI/ML-Ready image, but if you’re using a different base image, you need to install Fabric Manager manually.

Enable and Verify DCGM

As explained in NVIDIA’s documentation, you can run the core DCGM library in embedded mode (where the agent is loaded as a shared library) or in standalone mode (where the agent is embedded in a daemon). We recommend running in standalone mode because of its flexibility and reduced maintenance.

To run DCGM in standalone mode, you need to configure it to execute on startup. Enable the DCGM system service with sudo privileges and start it immediately:

sudo systemctl --now enable nvidia-dcgm

Elevated privileges are necessary because features like configuration settings and diagnostics do not work without privileged access to the GPU.

To verify that DCGM is running correctly, check the status of the DCGM service:

sudo service nvidia-dcgm status

If the service is installed and running, the nvidia-dcgm service is listed as active (running) in the output:

    
        
            
● nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor p>
     Active: active (running) since Mon 2024-09-23 23:48:27 UTC; 17s ago
   Main PID: 1793 (nv-hostengine)
      Tasks: 8 (limit: 289792)
     Memory: 20.8M
        CPU: 85ms
     CGroup: /system.slice/nvidia-dcgm.service
             └─1793 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Sep 23 23:48:27 ml-ai-ubuntu-gpu-h100x1-80gb-tor1 systemd[1]: Started NVIDIA DCGM service.
Sep 23 23:48:27 ml-ai-ubuntu-gpu-h100x1-80gb-tor1 nv-hostengine[1793]: DCGM initialized
Sep 23 23:48:27 ml-ai-ubuntu-gpu-h100x1-80gb-tor1 nv-hostengine[1793]: Started host engine version 3.3.8 using port number: 5555

        
    

On 8 GPU systems, you can similarly verify that Fabric Manager is running with sudo service nvidia-fabricmanager status.

Finally, verify that DCGM can find the GPU devices:

dcgmi discovery --list

The output for single GPU Droplets shows 1 GPU and no NvSwitches:

1 GPU found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA H100 80GB HBM3                                          |
|        | PCI Bus ID: 00000000:00:09.0                                         |
|        | Device UUID: GPU-927d7444-b45a-7c49-8402-5e686e46a026                |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+

The output for 8 GPU Droplets shows 8 GPUs and 4 NvSwitches.

Run DCGM Exporter

With DCGM installed and configured, you can now run DCGM Exporter to expose metrics data.

For simplicity, we recommend running it in a Docker container, but you can also deploy it as a daemonset on GPU nodes in a Kubernetes cluster using the NVIDIA GPU Operator.

First, install Docker:

sudo apt-get install -y docker.io

Next, enable the NVIDIA container runtime to use Docker:

nvidia-ctk runtime configure --runtime=docker

Restart Docker to apply the changes:

systemctl restart docker

By default, DCGM Exporter starts nv-hostengine as an embedded process, but our recommended setup runs nv-hostengine as a standalone process. The following command runs DCGM Exporter in a Docker container and uses the -r (--remote-hostengine-info) to connect to the existing process:

DCGM_EXPORTER_VERSION=3.3.8-3.6.0 &&
docker run -d --rm \
   --gpus all \
   --net host \
   --cap-add SYS_ADMIN \
   nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION}-ubuntu22.04 \
   -r localhost:5555 -f /etc/dcgm-exporter/dcp-metrics-included.csv

Make sure the DCGM version matches the version of datacenter-gpu-manager. You can check the version in the output of sudo service nvidia-dcgm status.

You can validate that DCGM Exporter is working by querying the exposed metrics endpoint:

curl localhost:9400/metrics

The default output resembles the following:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-927d7444-b45a-7c49-8402-5e686e46a026",pci_bus_id="00000000:00:09.0",device="nvidia0",modelName="NVIDIA H100 80GB HBM3",Hostname="6d6d7dc3ce21",DCGM_FI_DRIVER_VERSION="535.183.01"} 345
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-927d7444-b45a-7c49-8402-5e686e46a026",pci_bus_id="00000000:00:09.0",device="nvidia0",modelName="NVIDIA H100 80GB HBM3",Hostname="6d6d7dc3ce21",DCGM_FI_DRIVER_VERSION="535.183.01"} 2619
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
[...]

You can modify the config file, /etc/default-counters.csv, to define which GPU metrics you would like to export. See NVIDIA’s documentation on changing metrics for more information.

Additional Resources

Official instructions from NVIDIA for installing and configuring DCGM.
docs.nvidia.com
NVIDIA’s detailed instructions on running and customizing DCGM Exporter, integrating GPU telemetry into Kubernetes, and setting up Prometheus.
docs.nvidia.com
The DCGM-Exporter repository README has quickstart information for standalone and Kubernetes configurations, building from source, customizing output, and more.
github.com