Inference Features
Validated on 5 May 2026 • Last edited on 8 May 2026
Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.
Models
Models are large language models (LLMs) trained on large datasets to perform a variety of tasks.
-
Model Catalog: Browse available foundation models, including commercial and open-source options. You can view model capabilities and pricing, filter models by provider, classification, context window, and other attributes, open models in the Model Playground for testing, run inference using serverless deployments, or deploy models with dedicated infrastructure for production workloads.
Model Catalog is also available through an MCP server. For setup information, see Model Catalog MCP Tools.
-
Bring Your Own Models (BYOM): Import your own models into Model Catalog from Hugging Face or DigitalOcean Spaces, with support for gated Hugging Face models. BYOM imports support only Safetensors files and the following accompanying file types:
.json,.yaml,.yml,.jinja,.model,.txt,.png,.jpg,.jpeg,.md,LICENSE,NOTICE,.gitattributes, and.gitignore. Only dedicated inference-compatible architectures are supported, includingQwen2ForCausalLMandQwen3ForCausalLM. -
Model Playground: Test and compare model performance in a web-based interface. You can adjust settings like temperature and token limits, evaluate model responses, and fine-tune how your agents behave.
Serverless Inference
Send API requests directly to foundation models without creating or managing an agent. Serverless inference runs requests immediately using your model access key and model ID with no need to define instructions or context ahead of time. You can scope model access keys to specific foundation models and inference routers, enable batch inference, and restrict them to a VPC network.
Prompt Caching
Use prompt caching with the serverless inference chat completion and responses APIs to cache context and use it in future requests. If part of your request is already cached, you are charged a lower price for those cached tokens, and the standard price for the remaining input tokens. This significantly reduces the cost for inference.
For a list of models that support prompt caching, see Foundation Models. For examples, see Use Prompt Caching in the serverless inference documentation.
Tool (Function) Calling
Tool (function) calling enables foundation models to interact with external tools to access real-time data, perform actions, and extend their capabilities beyond their internal knowledge. The model identifies that a user request requires external information or action, and decides which tool to use. The model doesn’t execute the tool itself. Instead, it provides the necessary parameters to the external tool using a structured request to provide the required information or perform the action. The application runs the tool, and feeds the results back to the model. The model uses this information to produce a complete response.
All commercial models from Anthropic and OpenAI available on DigitalOcean support tool (function) calling.
Multimodal Inference
Multimodal models process and generate content across multiple data types, including images, audio, video, and text, thus enabling a much broader range of real-world applications, including document intelligence, voice agents, visual question answering, image generation, speech transcription, and video generation. We support the following types of multimodal models:
| Model Type | Description | Input | Output | Supported Models |
|---|---|---|---|---|
| Vision-Language Models (VLMs) | Use them for visual question answering, image summarization, document understanding, and multimodal reasoning. | Text, image, and video inputs | Text outputs | Nemotron-Nano 12B v2-VL, Kimi K2.5 Large, Kimi K2.6 |
| Text-to-Speech (TTS) Models | Synthesize natural-sounding speech from text input. | Text | Audio (in WAV or MP3 format) | Qwen3 TTS |
| Image Generation Models | Text-to-image models that create high-quality images from text prompts, and multimodal models that support both image generation and editing from text and image inputs. | Text prompt and optional image | Image | Stable Diffusion 3.5 Large, GPT Image 1.5 |
| Text-to-Video (T2V) Generation Models | Generate short video clips entirely from text prompts describing scenes, subjects, motion, and cinematic style. | Text prompt | MP4 video | Wan 2.2 T2V A14B |
Generated content is never stored long-term on behalf of users. For synchronous requests (image and audio generation), the output is returned directly in the API response. For asynchronous video generation requests, result storage is temporary and expires 2 hours after the job completes.
Observability
DigitalOcean gives you full visibility into the performance, cost, and reliability of every inference request running through DigitalOcean Inference. We provide metrics that map directly to how serverless inference workloads behave and how they are billed.
Inference observability surfaces real-time and historical metrics across latency, throughput, error rates, token consumption, cost attribution, and rate limiting. Using these metrics, you can:
- Detect error spikes, latency degradation, and rate-limit
- Trace usage back to a specific model, modality, or individual request.
- View per-request cost breakdown and model-level cost attribution to identify optimization opportunities.
- Compare token efficiency, latency, and cost across models to make informed optimization decisions
- Identify when rate limits or token limits are constraining throughput so you can adjust request patterns or request quota increases
Dedicated Inference
Dedicated Inference is a managed inference service that lets you host and scale open-source and commercial LLMs on dedicated GPUs and deploy your model as an inference endpoint. You can choose the GPU type per endpoint and scale them up or down by setting the desired node count, including scaling down GPU replicas to zero to prevent idle GPU time.
Dedicated Inference is a Kubernetes-native product where we manage ingress networking, RDMA for multi-node model serving, model storage, model serving engine (vLLM), single- and multi-node disaggregation, software components for autoscaling, prefix-aware routing, and parallelism.
Dedicated Inference is available in public preview and enabled for all users. You can contact support for questions or assistance.
Batch Inference
Batch Inference lets you run large collections of LLM requests as a single asynchronous job and retrieve results when processing completes, typically within 24 hours. Using batch inferencing significantly reduces cost compared to real-time inference. Use batch inference for high-volume, non-interactive workloads where you do not need immediate responses, such as for:
- Large-scale evaluation suites (MMLU, SimpleQA, SWE-bench)
- Synthetic data generation
- Document classification and content moderation at scale
- Data enrichment pipelines
- Offline summarization and transformation
If you are already using OpenAI or Anthropic models on DigitalOcean, you can switch to batch processing by changing a single endpoint without needing to make any schema changes.
Batch inference is resilient to partial failures at every stage:
-
Transient errors (such as
429,408, and5xx) on individual requests are retried automatically up to 2 times with exponential backoff before being written to the error file. -
Individual request failures do not fail the entire batch job. A single malformed prompt or content policy rejection is written to the error file without disrupting processing of other requests.
-
If a batch job expires or fails partway through, you can create a continuation job that processes only the requests that failed or were not reached, without reprocessing already completed work.
-
Job creation is idempotent. If your client application retries the request due to a network error, the platform returns the original batch object instead of creating a duplicate one.
-
If a job expires (exceeds the 24-hour window) or is cancelled, all requests that completed before the terminal state are saved to the output file and billed. No completed work is lost.
Inference Router public
Inference Router is available in public preview and enabled for all users. You can contact support for questions or assistance.
Most AI applications use one model for everything, which drives up prices and latency for unneeded tasks. Using the Inference Router automatically matches every request to the best-fit model based on the task at hand: coding, creative writing, summarization, and more.
Inference routing enables you to configure rules that route inference requests to foundation models, based on defined policies and tasks instead of selecting a single model. Use a DigitalOcean preset router to get started quickly, or define your own routing policy in natural language and set your priority in cost or latency. The router analyzes each prompt to infer the task (based on the configuration), then applies your preferences to select a model. When a model hits a rate limit or capacity constraint, requests fall back automatically with no dropped calls. Every routing decision is transparent and is traced with the model selected and task detected, allowing you to always know what ran and why.
Model Evaluations public
Model evaluations is in public preview. You can opt in from the Feature Preview page.
Model evaluation enables structured assessment of models for determining which model best fits your specific use case. Using this feature, you can assess model quality against your proprietary datasets using an LLM-as-a-Judge framework. You can measure how well your model performs across a variety of criteria, such as correctness, and safety metrics such as PII leakage. You can run evaluations using customizable test cases and system prompts that define what you want to measure and how. Model evaluations can provide useful feedback to help you tweak and improve your model’s responses.
Guardrails for Content Safety
Every inference request flows through a two-stage content safety pipeline.
-
Input guardrails: When a request is initiated, input guardrails are applied using prompt-level triage, which flags or blocks requests before inference begins.
-
Output guardrails: After the inference completes, generated content (such as images, audio, video) is evaluated using output guardrails.
Finally, our security process makes a policy decision on whether the content is safe or violates our platform content policy. Safe content is made available to the user, while content in violation of the platform content policy is withheld from the user and preserved for review. The API returns a standardized error response with error.type and reason_codes fields explaining the block. Attempts to bypass guardrails, whether successful or not, are logged. All generated content is traceable to the specific API key and user that initiated the request, with a precise timestamp.
Generated content is held in a private quarantine area until it passes the guardrails. Then, only content that passes review is promoted to a release state and can be accessed by you.
Claude Code and Other Agentic Workflow Support
The Messages API serves as the foundational engine for Claude Code for direct filesystem interaction and terminal execution. You can leverage Claude Code’s agentic capabilities within DigitalOcean using the Anthropic models. This allows developers to use the Claude Code interface to orchestrate tasks across diverse models like Anthropic Sonnet 4.6 for rapid iterations, Opus 4.6 for complex reasoning, or for specialized refactoring, all through a single, unified endpoint. This integration enables you to maintain a stateful, agentic workflow while optimizing for cost and performance across different model providers.
Agents
Agents are AI-powered tools that can perform a wide range of tasks, like answering questions or generating text content. Agents can use a combination of foundation models, knowledge bases, functions, and guardrails to inform their responses to user queries.
You can interact with agents in the following ways:
-
Agent endpoints: Each agent has an endpoint that allows you to interact with it through an API. You can integrate endpoints into your applications, customize requests to the agent, and authenticate them using access keys.
-
Chatbot embed: We provide a code snippet for each agent that allow you to embed a chatbot interface into your website or application.
-
Agent playground: We provide a web-based interface for interacting with agents, allowing you to test and refine agents.
-
Agent tracing: View a step-by-step timeline of how your agent processes prompts, including token usage, processing time, and resource access. Each trace also includes the full input and output for every interaction, giving you a complete record of the conversation flow.
- Insights: Analyze trace data to generate recommendations for improving efficiency and accuracy. Insights send trace data to a third-party model for processing, and you receive data-driven suggestions to reduce latency, optimize token usage, and improve agent behavior. These recommendations help you troubleshoot issues, enhance performance, and lower costs.
-
Agent feedback: End users and agent developers can provide feedback on the quality and helpfulness of agent responses. The feedback is collected through the chatbot interface, agent playground, and log stream traces.
-
Agent templates: We provide templates for common use cases, such as for customer support and business analysis. Templates have predefined instructions and foundation models that allow you to quickly create an agent.
Agent Guardrails
Guardrails scan an agent’s input and output for sensitive and inappropriate content and override the agent’s output when it detects the specified problematic content. For example, they help prevent an agent from sharing login credentials or credit card information when tuned correctly for your specific use case.
We offer the following guardrails that you can attach to your agent:
-
Sensitive Data: Identifies and anonymizes various categories of sensitive information, including credit card numbers, personally identifiable information, and location data.
-
Jailbreak: Helps your agent maintain proper functionality by preventing malicious inputs.
-
Content Moderation: Controls agent output by filtering responses related to inappropriate content categories, including violence and hate, sexual content, weapons, regulated substances, self-harm, and illegal activities.
Agent Evaluations
Agent evaluations are automated tests that can provide insight into how well your agents are responding to prompts you’ve provided. Workspaces let you run evaluations on multiple agents at once.
There are 19 evaluation metrics available you can use to evaluate your agents, including checking for factual correctness, instruction adherence, tone, and toxicity.
The test results are percentage pass/fail scores with visualizations so you can see your agents’ performance over time.
Agent and Function Routing
You can use agent and function routing to create more complex and dynamic responses to user queries.
-
Agent Routing directs queries to the right agent based on context.
-
Function Routing enhances agent responses with real-time or external data.
For example, you may have one agent to answer general travel questions and another to manage booking. Agent routing automatically sends booking-related requests to the booking agent for a more accurate response. Function routing can then call a function to retrieve weather information which the booking agent can include in its reply to provide more relevant travel recommendations.
Agent Development Kit (ADK) public
The DigitalOcean AI Agent Development Kit (ADK) is a Python SDK and CLI that lets you deploy agent code as a hosted, production-ready service. The ADK is in public preview. You can opt in from the Feature Preview page.
The ADK has the following key features:
-
Quick Setup: Build agents quickly with existing models that DigitalOcean provides, including commercial models from OpenAI and Anthropic, or third-party models. You can use any model key, even if the model isn’t hosted on DigitalOcean or the model key is not provided by DigitalOcean.
-
Python package
gradient-adk: The ADK includes a Python packagegradient-adkthat you can install usingpipto integrate with your existing code.
-
Framework Agnostic: Use LangGraph, LangChain, CrewAI, or custom code for your agent code.
-
Local Testing: Test on your machine before deploying. You can have multiple versions of your agent deployed to different environments in the same workspace simultaneously. This lets you test agent configurations and workflows exhaustively before you deploy the workflow to production.
-
One-Command Deploy: Use the
gradient agent deploycommand to deploy your agent and make it live. -
Automatic Monitoring: Use built-in traces and logs for monitoring agent performance. Capture calls to LangGraph nodes automatically to create traces to debug and evaluate agent performance when you use LangGraph to add nodes.
-
Streaming Support: View real-time responses.
-
Evaluation Framework: Run comprehensive evaluations with custom metrics.
-
Knowledge Base Integration: Connect to Knowledge Bases for RAG.
See Build Agents Using the Agent Development Kit to learn more about using the ADK to create agents.
DigitalOcean Knowledge Bases
A knowledge base is a private repository of unstructured content, such as files, folders, and URLs, that improves agent responses using retrieval-augmented generation (RAG). Knowledge bases store source data in DigitalOcean Spaces object storage and store indexes in a DigitalOcean OpenSearch cluster.
You can add data sources to knowledge bases from Spaces buckets, local files, seed or site map URLs, Dropbox folders, and Amazon S3 buckets.
Embeddings Models
The embeddings models convert unstructured data into vector embeddings so AI agents can find content that matches a user’s input.
Activity Logs
Activity logs give you visibility into indexing jobs for each knowledge base. You can view recent activity and download CSVs for debugging.
Retrieve
The retrieve feature lets you query a knowledge base for relevant chunks, apply metadata filters, and review the results for use in your applications and agent workflows. Each user query is vectorized using the knowledge base’s selected embeddings model.
You can retrieve data and run semantic, keyword, or hybrid searches, review scored chunks, and generate live Gradient SDK (Python) and cURL examples from the current query via the Control Panel or API.
You can optionally enable reranking to re-score and reorder retrieved chunks so the most relevant results appear first.
Knowledge base retrieval is also available through an MCP server for querying, filtering, and retrieving chunks. For setup information, see Knowledge Bases MCP Tools.
RAG Playground
RAG Playground lets you test how a selected serverless inference model answers a query using content retrieved from a knowledge base. You can enter a query, choose a model, and adjust settings such as system instructions, max tokens, and temperature.
RAG Playground shows the generated answer alongside retrieved chunks, including source details, page numbers, relevance scores, and which chunks were used in the response.
Auto-Indexing
Auto-indexing keeps data sources up to date by re-indexing changes on a recurring schedule.
Chunking
Chunking controls how documents are split before indexing. You configure chunking per data source and use different strategies in the same knowledge base.
DigitalOcean AI Platform supports section-based, semantic, hierarchical, and fixed length chunking for different document types, retrieval patterns, and cost needs.
For details and recommendations, see our chunking best practices and chunking parameters reference.