# How to Enable GPU Metrics on DigitalOcean Gradient™ AI GPU Droplets with DCGM

DigitalOcean Droplets are Linux-based virtual machines (VMs) that run on top of virtualized hardware. Each Droplet you create is a new server you can use, either standalone or as part of a larger, cloud-based infrastructure.

To access and monitor process statistics, health, and diagnostics data for the NVIDIA GPUs in GPU Droplets, we recommend using the [NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/dcgm) and [DCGM Exporter](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html).

If you only need GPU metrics in DigitalOcean Insights, enable [GPU Observability](https://docs.digitalocean.com/products/monitoring/details/features/index.html.md#gpu-observability) during Droplet creation by selecting **Improved Metrics and Monitoring** on an AI/ML-Ready Image. `do-agent` detects the GPU, scrapes the local exporter securely (bound to `127.0.0.1`), and forwards metrics to Insights. For metric definitions, see [Monitoring Metrics](https://docs.digitalocean.com/products/monitoring/concepts/metrics/index.html.md).

To integrate with external systems or customize dashboards beyond **Insights** tab, install a standalone exporter on the Droplet by either [using DCGM exporter for NVIDIA GPUs or `device-metrics-exporter` for AMD GPUs](https://docs.digitalocean.com/products/droplets/how-to/gpu/enable-metrics/index.html.md).

## Install DCGM

Create a single or 8 GPU Droplet using our provided AI/ML-ready image, which has NVIDIA drivers and software preinstalled and configured.

Alternatively, if you need to use a different base image, you can [manually install drivers and software](https://docs.digitalocean.com/products/droplets/getting-started/recommended-gpu-setup/index.html.md). See [NVIDIA’s official installation instructions](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html#id1) for more detail on the prerequisite configuration.

Then, on your GPU Droplet, install DCGM version 3.3.8 or later:

```bash
sudo apt-get install -y datacenter-gpu-manager
```

Next, restart `systemd-journal` to make sure DCGM logs are available:

```bash
sudo systemctl restart systemd-journald
```

### 8 GPU Droplets Only: Install NSCQ

On 8 GPU Droplets only, you additionally need to install the [NVIDIA Switch Configuration and Query (NSCQ) library](https://docs.nvidia.com/datacenter/tesla/hgx-software-guide/index.html#nscq).

First, get the driver version branch of the NVIDIA drivers installed on the Droplet:

```bash
dpkg-query -W -f='${Version}\n' 'nvidia-driver-*' | grep -v '^nvidia-driver-common' | head -n1
```

The output shows the version number (for example, `535.183.01-0ubuntu1`). Use the version number to install the matching NSCQ library packages. Substitute your Droplet’s driver version branch in the package name (for example, `libnvidia-nscq-535=535.183.01-1`).

```bash
sudo apt-get install -y libnvidia-nscq-DRIVER_VERSION_BRANCH
```

8 GPU Droplets also require [NVIDIA Fabric Manager](https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html). Fabric Manager is already configured in the AI/ML-Ready image, but if you’re using a different base image, you need to [install Fabric Manager manually](https://docs.digitalocean.com/products/droplets/getting-started/recommended-gpu-setup/index.html.md).

## Enable and Verify DCGM

As explained in [NVIDIA’s documentation](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html#modes-of-operation), you can run the core DCGM library in embedded mode (where the agent is loaded as a shared library) or in standalone mode (where the agent is embedded in a daemon). We recommend running in standalone mode because of its flexibility and reduced maintenance.

To run DCGM in standalone mode, you need to configure it to execute on startup. Enable the DCGM system service with `sudo` privileges and start it immediately:

```bash
sudo systemctl --now enable nvidia-dcgm
```

Elevated privileges are necessary because features like configuration settings and diagnostics do not work without privileged access to the GPU.

To verify that DCGM is running correctly, check the status of the DCGM service:

```bash
sudo service nvidia-dcgm status
```

If the service is installed and running, the `nvidia-dcgm` service is listed as `active (running)` in the output:

```text
● nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor p>
     Active: active (running) since Mon 2024-09-23 23:48:27 UTC; 17s ago
   Main PID: 1793 (nv-hostengine)
      Tasks: 8 (limit: 289792)
     Memory: 20.8M
        CPU: 85ms
     CGroup: /system.slice/nvidia-dcgm.service
             └─1793 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Sep 23 23:48:27 ml-ai-ubuntu-gpu-h100x1-80gb-tor1 systemd[1]: Started NVIDIA DCGM service.
Sep 23 23:48:27 ml-ai-ubuntu-gpu-h100x1-80gb-tor1 nv-hostengine[1793]: DCGM initialized
Sep 23 23:48:27 ml-ai-ubuntu-gpu-h100x1-80gb-tor1 nv-hostengine[1793]: Started host engine version 3.3.8 using port number: 5555
```

On 8 GPU systems, you can similarly verify that Fabric Manager is running with `sudo service nvidia-fabricmanager status`.

Finally, verify that DCGM can find the GPU devices:

```bash
dcgmi discovery --list
```

The output for single GPU Droplets shows 1 GPU and no NvSwitches:

```
1 GPU found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: NVIDIA H100 80GB HBM3 |
| | PCI Bus ID: 00000000:00:09.0 |
| | Device UUID: GPU-927d7444-b45a-7c49-8402-5e686e46a026 |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
```

The output for 8 GPU Droplets shows 8 GPUs and 4 NvSwitches.

## Run DCGM Exporter

With DCGM installed and configured, you can now run DCGM Exporter to expose metrics data.

For simplicity, we recommend running it in a Docker container, but you can also deploy it as a [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) on GPU nodes in a Kubernetes cluster using [the NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html).

First, install Docker:

```bash
sudo apt-get install -y docker.io
```

Next, enable the NVIDIA container runtime to use Docker:

```bash
nvidia-ctk runtime configure --runtime=docker
```

Restart Docker to apply the changes:

```bash
systemctl restart docker
```

By default, DCGM Exporter starts `nv-hostengine` as an embedded process, but our recommended setup runs `nv-hostengine` as a standalone process. The following command runs DCGM Exporter in a Docker container and uses the `-r` (`--remote-hostengine-info`) to connect to the existing process:

```shell
DCGM_EXPORTER_VERSION=3.3.8-3.6.0 &&
docker run -d --rm \
   --gpus all \
   --net host \
   --cap-add SYS_ADMIN \
   nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION}-ubuntu22.04 \
   -r localhost:5555 -f /etc/dcgm-exporter/dcp-metrics-included.csv
```

Make sure the DCGM version matches the version of `datacenter-gpu-manager`. You can check the version in the output of `sudo service nvidia-dcgm status`.

You can validate that DCGM Exporter is working by querying the exposed metrics endpoint:

```bash
curl localhost:9400/metrics
```

The default output resembles the following:

```
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-927d7444-b45a-7c49-8402-5e686e46a026",pci_bus_id="00000000:00:09.0",device="nvidia0",modelName="NVIDIA H100 80GB HBM3",Hostname="6d6d7dc3ce21",DCGM_FI_DRIVER_VERSION="535.183.01"} 345
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-927d7444-b45a-7c49-8402-5e686e46a026",pci_bus_id="00000000:00:09.0",device="nvidia0",modelName="NVIDIA H100 80GB HBM3",Hostname="6d6d7dc3ce21",DCGM_FI_DRIVER_VERSION="535.183.01"} 2619
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
[...]
```

You can modify the config file, `/etc/default-counters.csv`, to define which GPU metrics you would like to export. See [NVIDIA’s documentation on changing metrics](https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#changing-metrics) for more information.

## Additional Resources

[Getting Started — NVIDIA DCGM Documentation](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html): Official instructions from NVIDIA for installing and configuring DCGM.

[DCGM Exporter — NVIDIA Docs Hub](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html): NVIDIA’s detailed instructions on running and customizing DCGM Exporter, integrating GPU telemetry into Kubernetes, and setting up Prometheus.

[DCGM-Exporter GitHub Repository](https://github.com/NVIDIA/dcgm-exporter): The DCGM-Exporter repository README has quickstart information for standalone and Kubernetes configurations, building from source, customizing output, and more.