Serverless Inference Metrics

Validated on 27 Apr 2026 • Last edited on 27 Apr 2026

Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.

Inference observability provides real-time and historical metrics across latency, throughput, error rates, token consumption, cost attribution, and rate limiting. Using these metrics, you get visibility into the performance, cost, and reliability of every inference request. We provide the following metrics that map directly to how serverless inference workloads behave and how they are billed:

Reliability And Throughput

Metric Description
Error Rates Error rate with a 4xx vs 5xx split, so you can distinguish client-side issues (bad requests, auth failures) from server-side problems (model errors, capacity issues).
Success Rates Percentage of requests returning 2xx responses, giving you a single health signal for your inference workload.
Requests Per Second (RPS) Real-time request throughput, useful for understanding traffic patterns and capacity planning.

Latency

Metric Description
Time to First Token (TTFT) Time from request submission to receiving the first output token. Captures queue wait time and prefill latency (the metric that most directly affects perceived responsiveness in streaming applications).
End-to-End Latency Total time from sending the request to receiving the complete response. Covers queue wait, prefill, and full token generation, giving you the complete picture of request duration.

Cost And Usage

Metric Description
Cost Attribution Per-invocation cost tracking tied to specific models. Because serverless pricing is pay-per-use, this lets you see exactly which models and workloads drive your spend.
Per-Request Cost Breakdown Cost attributed to each individual request, including the model used and input/output token counts, so you can audit spend at the most granular level.
Total Token Usage Aggregate token consumption across all models, giving you a high-level view of overall platform utilization.
Token Usage per Model / Model Type Token consumption broken down by individual model or model category (text, vision-language, image, audio, video), so you can identify which models consume the most resources.

Multimodal Metrics

Metric Description
Image Count Number of images generated over time. Tracks image generation volume for models like Stable Diffusion 3.5 Large.
Audio Duration (Seconds) Total length of generated audio output in seconds. Relevant for text-to-speech models like Qwen 3 TTS, where billing and resource usage correlate with output duration.

We can't find any results for your search.

Try using different keywords or simplifying your search terms.