Give Feedback

Serverless Inference Metrics

Validated on 28 Apr 2026 • Last edited on 28 Apr 2026

Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.

Copy page as Markdown View page as Markdown

Inference observability provides real-time and historical metrics across latency, throughput, error rates, token consumption, cost attribution, and rate limiting. Using these metrics, you get visibility into the performance, cost, and reliability of every inference request. We provide the following metrics that map directly to how serverless inference workloads behave and how they are billed:

Reliability And Throughput

Metric	Description
Error Rates	Error rate with a 4xx vs 5xx split, so you can distinguish client-side issues (bad requests, auth failures) from server-side problems (model errors, capacity issues).
Success Rates	Percentage of requests returning 2xx responses, giving you a single health signal for your inference workload.
Requests Per Second (RPS)	Real-time request throughput, useful for understanding traffic patterns and capacity planning.

Latency

Metric	Description
Time to First Token (TTFT)	Time from request submission to receiving the first output token. Captures queue wait time and prefill latency (the metric that most directly affects perceived responsiveness in streaming applications).
End-to-End Latency	Total time from sending the request to receiving the complete response. Covers queue wait, prefill, and full token generation, giving you the complete picture of request duration.

Cost And Usage

Metric	Description
Cost Attribution	Per-invocation cost tracking tied to specific models. Because serverless pricing is pay-per-use, this lets you see exactly which models and workloads drive your spend.
Per-Request Cost Breakdown	Cost attributed to each individual request, including the model used and input/output token counts, so you can audit spend at the most granular level.
Total Token Usage	Aggregate token consumption across all models, giving you a high-level view of overall platform utilization.
Token Usage per Model / Model Type	Token consumption broken down by individual model or model category (text, vision-language, image, audio, video), so you can identify which models consume the most resources.

Multimodal Metrics

Metric	Description
Image Count	Number of images generated over time. Tracks image generation volume for models like Stable Diffusion 3.5 Large.
Audio Duration (Seconds)	Total length of generated audio output in seconds. Relevant for text-to-speech models like Qwen 3 TTS, where billing and resource usage correlate with output duration.

Serverless Inference Metrics

Reliability And Throughput

Latency

Cost And Usage

Multimodal Metrics

We can't find any results for your search.