Serverless Inference API Endpoints

Validated on 27 Apr 2026 • Last edited on 27 Apr 2026

Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.

To use serverless inference, authenticate your HTTP requests with a model access key or a DigitalOcean personal access token, then send your prompts to models for chat completions, image, audio, and text-to-speech generation.

Prerequisites

Create a model access key in the DigitalOcean Control Panel. Once you have a credential, send your prompts to models from OpenAI, Anthropic, Meta, or other providers using the serverless inference API endpoints.

Serverless Inference API Endpoints

Depending on the modality, the serverless inference endpoints can be synchronous or asynchronous:

  • Synchronous: The response is returned directly when the job completes. The output is returned directly in the API response as base64 and is not stored long-term on behalf of users. The endpoints are OpenAI compatible. Use for image generation and audio workloads.

  • Asynchronous: You submit a job, receive a job ID, and poll for the result. When the job completes, you then use the endpoint to fetch the complete generated result. Use for video generation and other long-running workloads.

    Warning
    For asynchronous video generation requests, result storage is temporary and expires 2 hours after the job completes. After this window, the generated video and any presigned download URLs are permanently purged and cannot be retrieved.

The following table shows the available serverless inference endpoints:

API Name Type Base URL Endpoint Verb Description
Model Synchronous https://inference.do-ai.run /v1/models GET Returns a list of available models and their IDs.
Chat Completions Synchronous https://inference.do-ai.run /v1/chat/completions POST Sends chat-style prompts and returns model responses.
Responses Synchronous https://inference.do-ai.run /v1/responses POST Sends chat-style prompts and returns text or multimodal model responses.
Images Synchronous https://inference.do-ai.run /v1/images/generations POST Generates images from text prompts. Supports output resolutions up to 1 megapixel (1024×1024).
VLM/Document Parsing Synchronous https://api.gradient.ai /v1/chat/completions POST Generate text from text and image inputs.
Image Generation Synchronous https://api.gradient.ai /v1/images/generations POST Generate images from text inputs.
Text-to-Speech Synchronous https://api.gradient.ai /v1/audio/speech POST Convert text to natural-sounding speech (binary audio data). The Content-Type header matches the requested format (such as audio/mpeg for mp3). Streaming uses chunked transfer encoding.
Video Asynchronous https://api.gradient.ai /v1/video/generations POST Generate short video clips from text prompts. Video generation can take anywhere from 30 seconds to several minutes. Output format is MP4 video, either 480p (9 seconds) or 720p (5 seconds). Duration is fixed per resolution tier.
fal Models Asynchronous https://inference.do-ai.run /v1/async-invoke POST Sends text, image, or text-to-speech generation requests to [fal models]products/ai-platform/details/models#foundation-models" >}}).
Embeddings Synchronous https://inference.do-ai.run /v1/embeddings POST Convert text into dense vector representations for use in semantic search, retrieval-augmented generation (RAG), clustering, classification, and similarity matching.
Messages Synchronous https://inference.do-ai.run /v1/messages POST Serves as the interface for Claude Code and other agentic workflows for direct filesystem interaction and terminal execution through DigitalOcean. Setting the ANTHROPIC_BASE_URL to the DigitalOcean inference endpoint lets you bypass vendor lock-in.

For more information, see the API reference.

We support both Chat Completions and Responses APIs for sending prompts. Choose the endpoint that best fits your use case:

  • Use the Chat Completions API when building or maintaining chat-style integrations that rely on structured messages with roles such as system, user, and assistant, or when migrating existing chat-based code with minimal changes.

  • Use the Responses API when building new integrations or working with newer models that only support the Responses API. It’s also useful for multi-step tool use in a single request, preserving state across turns with store: true, and simplifying requests by using a single input field with improved caching efficiency.

  • Use the Embeddings API to convert text into dense vector representations for use in semantic search, retrieval-augmented generation (RAG), clustering, classification, and similarity matching.

You can use these endpoints through cURL, Python OpenAI, Gradient Python SDK, and PyDo.

Alternatively, you can call serverless inference from your automation workflows. The n8n community node connects to any DigitalOcean-hosted model using your model access key. You can self-host n8n using the n8n Marketplace app.

We can't find any results for your search.

Try using different keywords or simplifying your search terms.