Give Feedback

Serverless Inference API Endpoints

Validated on 28 Apr 2026 • Last edited on 14 May 2026

Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.

Copy page as Markdown View page as Markdown

To use serverless inference, authenticate your HTTP requests with a model access key or a DigitalOcean personal access token, then send your prompts to models for chat completions, image, audio, and text-to-speech generation.

Prerequisites

Create a model access key in the DigitalOcean Control Panel. Once you have a credential, send your prompts to models from OpenAI, Anthropic, Meta, or other providers using the serverless inference API endpoints.

Serverless Inference API Endpoints

Depending on the modality, the serverless inference endpoints can be synchronous or asynchronous:

Synchronous: The response is returned directly when the job completes. The output is returned directly in the API response as base64 and is not stored long-term on behalf of users. The endpoints are OpenAI compatible. Use for image generation and audio workloads.
Asynchronous: You submit a job, receive a job ID, and poll for the result. When the job completes, you then use the endpoint to fetch the complete generated result. Use for video generation and other long-running workloads.

Warning
For asynchronous video generation requests, result storage is temporary and expires 2 hours after the job completes. After this window, the generated video and any presigned download URLs are permanently purged and cannot be retrieved.

The following table shows the available serverless inference endpoints:

API Name	Type	Base URL	Endpoint	Verb	Description
Model	Synchronous	`https://inference.do-ai.run`	`/v1/models`	GET	Returns a list of available models and their IDs. For a list of models we support for serverless inference, see Foundation Models.
Chat Completions	Synchronous	`https://inference.do-ai.run`	`/v1/chat/completions`	POST	Sends chat-style prompts and returns model responses.
Responses	Synchronous	`https://inference.do-ai.run`	`/v1/responses`	POST	Sends chat-style prompts and returns text or multimodal model responses.
Images	Synchronous	`https://inference.do-ai.run`	`/v1/images/generations`	POST	Generates images from text prompts. Supports output resolutions up to 1 megapixel (1024×1024).
VLM/Document Parsing	Synchronous	`https://inference.do-ai.run`	`/v1/chat/completions`	POST	Generate text from text and image inputs.
Image Generation	Synchronous	`https://inference.do-ai.run`	`/v1/images/generations`	POST	Generate images from text inputs.
Text-to-Speech	Synchronous	`https://inference.do-ai.run`	`/v1/audio/speech`	POST	Convert text to natural-sounding speech (binary audio data). The `Content-Type` header matches the requested format (such as audio/mpeg for mp3). Streaming uses chunked transfer encoding.
Video	Asynchronous	`https://inference.do-ai.run`	`/v1/video/generations`	POST	Generate short video clips from text prompts. Video generation can take anywhere from 30 seconds to several minutes. Output format is MP4 video, either 480p (9 seconds) or 720p (5 seconds). Duration is fixed per resolution tier.
fal Models	Asynchronous	`https://inference.do-ai.run`	`/v1/async-invoke`	POST	Sends text, image, or text-to-speech generation requests to [fal models](products/ai-platform/details/models#foundation-models" >}}).
Embeddings	Synchronous	`https://inference.do-ai.run`	`/v1/embeddings`	POST	Convert text into dense vector representations for use in semantic search, retrieval-augmented generation (RAG), clustering, classification, and similarity matching.
Messages	Synchronous	`https://inference.do-ai.run`	`/v1/messages`	POST	Serves as the interface for Claude Code and other agentic workflows for direct filesystem interaction and terminal execution through DigitalOcean. Setting the `ANTHROPIC_BASE_URL` to the DigitalOcean inference endpoint lets you bypass vendor lock-in.

For more information, see the API reference.

We support both Chat Completions and Responses APIs for sending prompts. Choose the endpoint that best fits your use case:

Use the Chat Completions API when building or maintaining chat-style integrations that rely on structured messages with roles such as system, user, and assistant, or when migrating existing chat-based code with minimal changes.
Use the Responses API when building new integrations or working with newer models that only support the Responses API. It’s also useful for multi-step tool use in a single request, preserving state across turns with store: true, and simplifying requests by using a single input field with improved caching efficiency.
Use the Multimodal API when generating content across multiple data types, including images, audio, video, and text, for document intelligence, voice agents, content generation, and accessibility tools
Use the Embeddings API to convert text into dense vector representations for use in semantic search, retrieval-augmented generation (RAG), clustering, classification, and similarity matching.

You can use these endpoints through cURL, Python OpenAI, Gradient Python SDK, and PyDo.

Alternatively, you can call serverless inference from your automation workflows. The n8n community node connects to any DigitalOcean-hosted model using your model access key. You can self-host n8n using the n8n Marketplace app.

Available Models for Serverless Inference

A list of the foundation, embeddings, and reranking models available for Inference.

We can't find any results for your search.

Try using different keywords or simplifying your search terms.