Serverless Inference API Endpoints
Validated on 27 Apr 2026 • Last edited on 27 Apr 2026
Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.
To use serverless inference, authenticate your HTTP requests with a model access key or a DigitalOcean personal access token, then send your prompts to models for chat completions, image, audio, and text-to-speech generation.
Prerequisites
Create a model access key in the DigitalOcean Control Panel. Once you have a credential, send your prompts to models from OpenAI, Anthropic, Meta, or other providers using the serverless inference API endpoints.
Serverless Inference API Endpoints
Depending on the modality, the serverless inference endpoints can be synchronous or asynchronous:
-
Synchronous: The response is returned directly when the job completes. The output is returned directly in the API response as base64 and is not stored long-term on behalf of users. The endpoints are OpenAI compatible. Use for image generation and audio workloads.
-
Asynchronous: You submit a job, receive a job ID, and poll for the result. When the job completes, you then use the endpoint to fetch the complete generated result. Use for video generation and other long-running workloads.
Warning For asynchronous video generation requests, result storage is temporary and expires 2 hours after the job completes. After this window, the generated video and any presigned download URLs are permanently purged and cannot be retrieved.
The following table shows the available serverless inference endpoints:
| API Name | Type | Base URL | Endpoint | Verb | Description |
|---|---|---|---|---|---|
| Model | Synchronous | https://inference.do-ai.run |
/v1/models |
GET | Returns a list of available models and their IDs. |
| Chat Completions | Synchronous | https://inference.do-ai.run |
/v1/chat/completions |
POST | Sends chat-style prompts and returns model responses. |
| Responses | Synchronous | https://inference.do-ai.run |
/v1/responses |
POST | Sends chat-style prompts and returns text or multimodal model responses. |
| Images | Synchronous | https://inference.do-ai.run |
/v1/images/generations |
POST | Generates images from text prompts. Supports output resolutions up to 1 megapixel (1024×1024). |
| VLM/Document Parsing | Synchronous | https://api.gradient.ai |
/v1/chat/completions |
POST | Generate text from text and image inputs. |
| Image Generation | Synchronous | https://api.gradient.ai |
/v1/images/generations |
POST | Generate images from text inputs. |
| Text-to-Speech | Synchronous | https://api.gradient.ai |
/v1/audio/speech |
POST | Convert text to natural-sounding speech (binary audio data). The Content-Type header matches the requested format (such as audio/mpeg for mp3). Streaming uses chunked transfer encoding. |
| Video | Asynchronous | https://api.gradient.ai |
/v1/video/generations |
POST | Generate short video clips from text prompts. Video generation can take anywhere from 30 seconds to several minutes. Output format is MP4 video, either 480p (9 seconds) or 720p (5 seconds). Duration is fixed per resolution tier. |
| fal Models | Asynchronous | https://inference.do-ai.run |
/v1/async-invoke |
POST | Sends text, image, or text-to-speech generation requests to [fal models]products/ai-platform/details/models#foundation-models" >}}). |
| Embeddings | Synchronous | https://inference.do-ai.run |
/v1/embeddings |
POST | Convert text into dense vector representations for use in semantic search, retrieval-augmented generation (RAG), clustering, classification, and similarity matching. |
| Messages | Synchronous | https://inference.do-ai.run |
/v1/messages |
POST | Serves as the interface for Claude Code and other agentic workflows for direct filesystem interaction and terminal execution through DigitalOcean. Setting the ANTHROPIC_BASE_URL to the DigitalOcean inference endpoint lets you bypass vendor lock-in. |
For more information, see the API reference.
We support both Chat Completions and Responses APIs for sending prompts. Choose the endpoint that best fits your use case:
-
Use the Chat Completions API when building or maintaining chat-style integrations that rely on structured
messageswith roles such assystem,user, andassistant, or when migrating existing chat-based code with minimal changes. -
Use the Responses API when building new integrations or working with newer models that only support the Responses API. It’s also useful for multi-step tool use in a single request, preserving state across turns with
store: true, and simplifying requests by using a singleinputfield with improved caching efficiency. -
Use the Embeddings API to convert text into dense vector representations for use in semantic search, retrieval-augmented generation (RAG), clustering, classification, and similarity matching.
You can use these endpoints through cURL, Python OpenAI, Gradient Python SDK, and PyDo.
Alternatively, you can call serverless inference from your automation workflows. The n8n community node connects to any DigitalOcean-hosted model using your model access key. You can self-host n8n using the n8n Marketplace app.