Serverless Inference Overview
Validated on 27 Apr 2026 • Last edited on 27 Apr 2026
Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.
Serverless inference lets you send API requests directly to foundation models without creating an AI agent or managing infrastructure. Authenticate requests to the serverless inference API with a model access key or a DigitalOcean personal access token. Model access keys are recommended because you can scope them to specific foundation models, enable batch inference, and restrict them to a VPC network.
Serverless inference automatically scales to handle incoming requests and supports generating text, images, audio, and other model outputs. Because serverless inference does not maintain sessions, each request must include the full context needed by the model.
All requests are billed per input and output token.
When to Use Serverless Inference Versus Dedicated Inference
Dedicated Inference is a managed inference service that enables you to host and scale open-source and commercial LLMs on dedicated GPUs. It gives you more control over the environment so you can choose the GPU, tune performance, and optimize your models for throughput, latency, cost or concurrency. Dedicated inference is best suited for steady, high-throughput workloads.
Serverless inference lets you send API requests directly to foundation models. Choose serverless inference over dedicated inference when you need to get started quickly without managing any components behind an inference endpoint, don’t have a custom model to host or optimize, or have unpredictable or spiky inference traffic.
Pricing for serverless inference is based on the number of tokens used, while pricing for dedicated inference is based on the GPU hours used.
If you want to use dedicated inference, see Use Dedicated Inference.