Give Feedback

Serverless Inference Overview

Validated on 28 Apr 2026 • Last edited on 14 May 2026

Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.

Copy page as Markdown View page as Markdown

Serverless inference lets you send API requests directly to foundation models without creating an AI agent or managing infrastructure. Authenticate requests to the serverless inference API with a model access key or a DigitalOcean personal access token. Model access keys are recommended because you can scope them to specific foundation models, enable batch inference, and restrict them to a VPC network.

Serverless inference automatically scales to handle incoming requests and supports generating text, images, audio, and other model outputs. Because serverless inference does not maintain sessions, each request must include the full context needed by the model.

All requests are billed per input and output token.

Warning

Based on your tier, you have an allocated amount of usage before we charge you. For example, $25 for tier 1. Once you’ve hit that limit, we charge you for that usage. Additional inference usage is capped until you pay that bill.

When to Use Serverless Inference Versus Dedicated Inference

Dedicated Inference is a managed inference service that enables you to host and scale open-source and commercial LLMs on dedicated GPUs. It gives you more control over the environment so you can choose the GPU, tune performance, and optimize your models for throughput, latency, cost or concurrency. Dedicated inference is best suited for steady, high-throughput workloads.

Serverless inference lets you send API requests directly to foundation models. Choose serverless inference over dedicated inference when you need to get started quickly without managing any components behind an inference endpoint, don’t have a custom model to host or optimize, or have unpredictable or spiky inference traffic.

Pricing for serverless inference is based on the number of tokens used, while pricing for dedicated inference is based on the GPU hours used.

If you want to use dedicated inference, see Use Dedicated Inference.

Available Models for Serverless Inference

A list of the foundation, embeddings, and reranking models available for Inference.

Serverless Inference Overview

When to Use Serverless Inference Versus Dedicated Inference

We can't find any results for your search.