Inference Features

Validated on 27 Apr 2026 • Last edited on 27 Apr 2026

Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.

Models

Models are large language models (LLMs) trained on large datasets to perform a variety of tasks.

  • Model Catalog: Browse available foundation models, including commercial and open-source options. You can view model capabilities and pricing, filter models by provider, classification, context window, and other attributes, open models in the Model Playground for testing, run inference using serverless deployments, or deploy models with dedicated infrastructure for production workloads.

  • Model Catalog is also available through an MCP server. For setup information, see Model Catalog MCP Tools.

  • Model Playground: Test and compare model performance in a web-based interface. You can adjust settings like temperature and token limits, evaluate model responses, and fine-tune how your agents behave.

Serverless Inference

Send API requests directly to foundation models without creating or managing an agent. Serverless inference runs requests immediately using your model access key and model ID with no need to define instructions or context ahead of time. You can scope model access keys to specific foundation models and inference routers, enable batch inference, and restrict them to a VPC network.

Prompt Caching

Use prompt caching with the serverless inference chat completion and responses APIs to cache context and use it in future requests. If part of your request is already cached, you are charged a lower price for those cached tokens, and the standard price for the remaining input tokens. This significantly reduces the cost for inference.

For a list of models we support prompt caching for, see Foundation Models. For examples, see Use Prompt Caching in the serverless inference documentation.

Tool (Function) Calling

Tool (function) calling enables foundation models to interact with external tools to access real-time data, perform actions, and extend their capabilities beyond their internal knowledge. The model identifies that a user request requires external information or action, and decides which tool to use. The model doesn’t execute the tool itself. Instead, it provides the necessary parameters to the external tool using a structured request to provide the required information or perform the action. The application runs the tool, and feeds the results back to the model. The model uses this information to produce a complete response.

All commercial models from Anthropic and OpenAI available on DigitalOcean support tool (function) calling.

Observability

DigitalOcean gives you full visibility into the performance, cost, and reliability of every inference request running through DigitalOcean AI Platform. We provide metrics that map directly to how serverless inference workloads behave and how they are billed.

Inference observability surfaces real-time and historical metrics across latency, throughput, error rates, token consumption, cost attribution, and rate limiting. Using these metrics, you can:

  • Detect error spikes, latency degradation, and rate-limit
  • Trace usage back to a specific model, modality, or individual request.
  • View per-request cost breakdown and model-level cost attribution to identify optimization opportunities.
  • Compare token efficiency, latency, and cost across models to make informed optimization decisions
  • Identify when rate limits or token limits are constraining throughput so you can adjust request patterns or request quota increases

Dedicated Inference

Dedicated Inference is a managed inference service that lets you host and scale open-source and commercial LLMs on dedicated GPUs and deploy your model as an inference endpoint. You can choose the GPU type per endpoint and scale them up or down by setting the desired node count, including scaling down GPU replicas to zero to prevent idle GPU time.

Dedicated Inference is a Kubernetes-native product where we manage ingress networking, RDMA for multi-node model serving, model storage, model serving engine (vLLM), single- and multi-node disaggregation, software components for autoscaling, prefix-aware routing, and parallelism.

Dedicated Inference is available in public preview and enabled for all users. You can contact support for questions or assistance.

Batch Inference

Note
Only text prompts for OpenAI and Anthropic commercial models are supported for batch inference.

Batch Inference lets you run large collections of LLM requests as a single asynchronous job and retrieve results when processing completes, typically within 24 hours. Using batch inferencing significantly reduces cost compared to real-time inference. Use batch inference for high-volume, non-interactive workloads where you do not need immediate responses, such as for:

  • Large-scale evaluation suites (MMLU, SimpleQA, SWE-bench)
  • Synthetic data generation
  • Document classification and content moderation at scale
  • Data enrichment pipelines
  • Offline summarization and transformation

If you are already using OpenAI or Anthropic models on DigitalOceean, you can switch to batch processing by changing a single endpoint without needing to make any schema changes.

Batch inference is resilient to partial failures at every stage:

  • Transient errors (such as 429, 408, and 5xx) on individual requests are retried automatically up to 2 times with exponential backoff before being written to the error file.

  • Individual request failures do not fail the entire batch job. A single malformed prompt or content policy rejection is written to the error file without disrupting processing of other requests.

  • If a batch job expires or fails partway through, you can create a continuation job that processes only the requests that failed or were not reached, without reprocessing already completed work.

  • Job creation is idempotent. If your client application retries the request due to a network error, the platform returns the original batch object instead of creating a duplicate one.

  • If a job expires (exceeds the 24-hour window) or is cancelled, all requests that completed before the terminal state are saved to the output file and billed. No completed work is lost.

Guardrails for Content Safety

Every inference request flows through a two-stage content safety pipeline.

  • Input guardrails: When a request is initiated, input guardrails are applied using prompt-level triage, which flags or blocks requests before inference begins.

  • Output guardrails: After the inference completes, generated content (such as images, audio, video) is evaluated using output guardrails.

Finally, our security process makes a policy decision on whether the content is safe or violates our platform content policy. Safe content is made available to the user, while content in violation of the platform content policy is withheld from the user and preserved for review. The API returns a standardized error response with error.type and reason_codes fields explaining the block. Attempts to bypass guardrails, whether successful or not, are logged. All generated content is traceable to the specific API key and user that initiated the request, with a precise timestamp.

Generated content is held in a private quarantine area until it passes the guardrails. Then, only content that passes review is promoted to a release state and can be accessed by you.

Claude Code and Other Agentic Workflow Support

The Messages API serves as the foundational engine for Claude Code for direct filesystem interaction and terminal execution. You can leverage Claude Code’s agentic capabilities within DigitalOcean using the Anthropic models. This allows developers to use the Claude Code interface to orchestrate tasks across diverse models like Anthropic Sonnet 4.6 for rapid iterations, Opus 4.6 for complex reasoning, or for specialized refactoring, all through a single, unified endpoint. This integration enables you to maintain a stateful, agentic workflow while optimizing for cost and performance across different model providers.

Model Evaluations public

Model evaluations is in public preview. You can opt in from the Feature Preview page.

Model evaluation enables structured assessment of models for determining which model best fits your specific use case. Using this feature, you can assess model quality against your proprietary datasets using an LLM-as-a-Judge framework. You can measure how well your model performs across a variety of criteria, such as correctness, and safety metrics such as PII leakage. You can run evaluations using customizable test cases and system prompts that define what you want to measure and how. Model evaluations can provide useful feedback to help you tweak and improve your model’s responses.

We can't find any results for your search.

Try using different keywords or simplifying your search terms.