Monitoring Metrics
Validated on 3 Nov 2025 • Last edited on 10 Nov 2025
DigitalOcean Monitoring is a free, opt-in service that lets you track Droplet resource usage in real time, visualize performance metrics, and receive alerts via email or Slack to proactively manage your infrastructure’s health.
DigitalOcean Monitoring tracks your Droplet’s resource usage over time, including CPU, memory, disk I/O, and GPU metrics for utilization, throttling, power, and ECC errors. This helps you understand performance trends and troubleshoot potential issues. You can view these metrics in real time or analyze historical data to identify patterns, spot bottlenecks, and improve overall reliability.
Monitoring works by installing the metrics agent on your Droplet. Once installed, the agent collects system metrics and sends them to the DigitalOcean Control Panel, where you can view the data in charts and set up alerts based on custom thresholds.
GPU Occupancy
GPU occupancy measures how busy the GPU’s compute units are over time, expressed as a percentage (%). It shows how efficiently your workload uses the GPU’s parallel processing cores.
A sustained GPU Occupancy (%) of 100% means your workload is fully using the GPU’s compute resources. Brief spikes are normal during burst activity and don’t necessarily indicate an issue.
nvidia-smi measures total active time.
Tensor Utilization
Tensor utilization measures how frequently the GPU’s tensor cores are active, expressed as a percentage (%). Tensor cores handle specialized matrix and AI/ML computations, making this metric especially useful for monitoring deep learning workloads.
Higher Tensor Utilization (%) means your workload is taking advantage of tensor core acceleration. If the value stays consistently low during ML or inference tasks, your model may not be optimized to use tensor operations efficiently.
GPU Memory Utilization
GPU memory utilization measures how much of the GPU’s total memory (VRAM) is currently in use, expressed as a percentage (%). It helps you track memory pressure and determine whether your workload fits within available GPU memory.
A high GPU Memory Utilization (%) indicates that most of the GPU’s memory is in use. If the value approaches 100%, the GPU may run out of memory, causing errors or reduced performance. Consider using smaller batch sizes or models if you see frequent out-of-memory errors.
GPU Core Temperature
GPU core temperature measures the heat level of the GPU’s main processing die, expressed in degrees Celsius (°C). It helps you detect overheating that can lead to throttling or reduced performance.
A stable core temperature within your GPU’s rated range indicates normal operation. Consistently high temperatures may lead to performance throttling, which can decrease workload efficiency.
GPU Memory Temperature
GPU memory temperature measures the heat of the GPU’s onboard memory modules (VRAM), expressed in degrees Celsius (°C). Monitoring memory temperature helps detect potential heat buildup that could affect memory reliability and overall performance.
Sustained high memory temperatures can cause throttling or reduced throughput in memory-intensive workloads. Consistently elevated temperatures may decrease performance efficiency.
GPU Power Usage
GPU power usage measures the GPU’s current power draw in watts (W). It helps you understand how much electrical power your workload consumes and whether the GPU is operating near its designed power limits.
Power usage fluctuates based on workload intensity. A steady increase in power draw typically indicates sustained GPU activity, while sudden drops or spikes may suggest throttling, idle periods, or transitions between GPU tasks.
Monitoring this metric alongside temperature and throttling data helps identify whether performance limits are due to power or thermal constraints.
PCIe Throughput
PCIe throughput measures the data transfer rate between the GPU and the host system over the PCI Express (PCIe) bus, expressed in gigabits per second (Gbps). It helps you understand how much data moves between the GPU and CPU during workload execution.
Higher PCIe throughput typically indicates frequent data movement between the GPU and CPU, such as when loading models, transferring tensors, or exchanging intermediate results. If throughput remains consistently high, consider optimizing data transfer or increasing batch sizes to reduce PCIe overhead.
NVLink / XGMI Throughput
NVLink (for NVIDIA GPUs) and XGMI (for AMD GPUs) throughput measure peer-to-peer data transfer between multiple GPUs, expressed in gigabits per second (Gbps). This metric shows how efficiently GPUs exchange data in multi-GPU configurations.
Consistently high NVLink or XGMI throughput indicates strong communication between GPUs, such as during distributed training or large-scale parallel computations. If these values drop unexpectedly, check for workload imbalance or driver configuration issues that could limit multi-GPU performance.
Power Throttling
Power throttling measures the amount of time the GPU spends reducing performance to stay within its power limits, expressed as a percentage (%). This metric helps you identify when workloads push the GPU beyond its configured power envelope.
Short bursts of power throttling are normal under heavy load, but sustained or frequent power limit violations may indicate that the GPU is operating near its maximum design power. If this metric increases over time, review your workload’s power draw and consider reducing GPU utilization or enabling power management options.
Thermal Throttling
Thermal throttling measures the amount of time the GPU reduces performance to prevent overheating, expressed as a percentage (%). This metric helps detect temperature-related throttling that may affect sustained GPU performance.
Brief thermal throttling may occur during intensive workloads, but consistent or long-duration throttling indicates that the GPU is overheating. Review your workload intensity if thermal throttling durations rise over time.
GPU ECC Error Count
GPU ECC error count measures the number of double-bit memory errors detected by the GPU’s error-correcting code (ECC) system. It helps you identify potential memory instability or data integrity issues that may affect workload reliability.
A small number of ECC errors can occur naturally over time, especially under sustained or memory-intensive workloads. However, frequent or recurring ECC errors may indicate hardware degradation, excessive heat, or unstable power delivery.
If the count continues to increase, consider monitoring temperature trends, checking power stability, or replacing the GPU to prevent data corruption.
Load Average
Load average shows how many processes use or wait to use the CPU over time, but it does not consider how many vCPUs a Droplet has. For example, a load average of two on a single vCPU Droplet means one process uses the CPU and another waits, which shows the Droplet is over capacity. On a Droplet with four vCPUs, the same load means the system uses only half of its available CPU power.
The metrics agent reports 1, 5, and 15 minute load averages using data from /proc/loadavg, which tracks the number of processes that are either running or waiting for CPU time during those intervals.
We recommend comparing the load average to the number of vCPUs to understand how efficiently your Droplet uses its CPU resources. If the load average is higher than the vCPU count for an extended period, the Droplet may not have enough resources for the workload. To monitor this, use the 15 minute load average metric to see if your Droplet consistently exceeds the number of vCPUs. Brief spikes in the one and five minute averages usually reflect short-term activity and aren’t a concern.
Memory Utilization
Memory utilization shows the percentage of physical memory in use. We calculate this using data from /proc/meminfo, which provides detailed information about total, free, and cached memory available on the system.
We calculate used memory by subtracting both free and cached memory from the total. This is because Linux uses unused memory for disk caching to improve performance, but frees that memory when it’s needed by processes. Since cached memory is effectively available, we don’t count it as used. As a result, tools like htop and top may show higher usage because they include cached memory in their used memory calculations.
CPU Utilization
CPU utilization measures the percentage of total processing power the Droplet is using. Alert policies track total CPU usage without distinguishing between the system time (kernel-level instructions) and user time (everything outside the kernel).
We represent total usage across all CPUs as 100%. However, some tools report 100% per CPU core, so a system with two cores may show 200%, and a system with four cores may show 400%.
Disk Utilization
Disk utilization shows the percentage of total disk space in use. This includes the Droplet’s root storage and any attached block storage devices. We combine all storage into a single value that reflects total usage.
Alert policies also use this combined total when monitoring disk space.
Disk I/O
Disk I/O measures how much data the Droplet reads from and writes to its disks. High disk I/O can signal performance issues caused by intensive read or write operations.
You can set alert policies to monitor read activity and get notified if usage exceeds your thresholds. You can also create alert policies to monitor write activity and detect potential spikes or bottlenecks.
Bandwidth
Bandwidth measures the amount of incoming and outgoing network traffic on a Droplet. High usage may indicate heavy network activity or unusual traffic patterns.
These graphs split bandwidth into public and private traffic. Public bandwidth measures traffic sent and received through the Droplet’s public network interface, while private bandwidth tracks traffic between your Droplets in the same datacenter. Each graph includes separate lines for inbound and outbound traffic.