For AI agents: The documentation index is at https://docs.digitalocean.com/llms.txt. Markdown versions of pages use the same URL with index.html.md in place of the HTML page (for example, append index.html.md to the directory path instead of opening the HTML document).
Inference observability provides real-time and historical metrics across latency, throughput, error rates, token consumption, cost attribution, and rate limiting. Using these metrics, you get visibility into the performance, cost, and reliability of every inference request. We provide the following metrics that map directly to how serverless inference workloads behave and how they are billed:
Reliability And Throughput
| Metric |
Description |
| Error Rates |
Error rate with a 4xx vs 5xx split, so you can distinguish client-side issues (bad requests, auth failures) from server-side problems (model errors, capacity issues). |
| Success Rates |
Percentage of requests returning 2xx responses, giving you a single health signal for your inference workload. |
| Requests Per Second (RPS) |
Real-time request throughput, useful for understanding traffic patterns and capacity planning. |
Latency
| Metric |
Description |
| Time to First Token (TTFT) |
Time from request submission to receiving the first output token. Captures queue wait time and prefill latency (the metric that most directly affects perceived responsiveness in streaming applications). |
| End-to-End Latency |
Total time from sending the request to receiving the complete response. Covers queue wait, prefill, and full token generation, giving you the complete picture of request duration. |
Cost And Usage
| Metric |
Description |
| Cost Attribution |
Per-invocation cost tracking tied to specific models. Because serverless pricing is pay-per-use, this lets you see exactly which models and workloads drive your spend. |
| Per-Request Cost Breakdown |
Cost attributed to each individual request, including the model used and input/output token counts, so you can audit spend at the most granular level. |
| Total Token Usage |
Aggregate token consumption across all models, giving you a high-level view of overall platform utilization. |
| Token Usage per Model / Model Type |
Token consumption broken down by individual model or model category (text, vision-language, image, audio, video), so you can identify which models consume the most resources. |
Multimodal Metrics
| Metric |
Description |
| Image Count |
Number of images generated over time. Tracks image generation volume for models like Stable Diffusion 3.5 Large. |
| Audio Duration (Seconds) |
Total length of generated audio output in seconds. Relevant for text-to-speech models like Qwen 3 TTS, where billing and resource usage correlate with output duration. |