How to Use Serverless Inference on DigitalOcean Gradient™ AI Platform
Validated on 9 Feb 2026 • Last edited on 13 Feb 2026
DigitalOcean Gradient™ AI Platform lets you build fully-managed AI agents with knowledge bases for retrieval-augmented generation, multi-agent routing, guardrails, and more, or use serverless inference to make direct requests to popular foundation models.
Serverless inference lets you send API requests directly to foundation models without creating or managing an AI agent. This generates responses without any initial instructions or configuration to the model.
All requests are billed per input and output token.
Prerequisites
To use serverless inference, you need to authenticate your HTTP requests with a model access key. You can create a model access key in the DigitalOcean Control Panel or by sending a POST request to the /v2/gen-ai/models/api_keys endpoint. Then, send your prompts to models from OpenAI, Anthropic, Meta, or other providers using the serverless inference API endpoints.
Serverless Inference API Endpoints
The serverless inference API is available at https://inference.do-ai.run and has the following endpoints:
| Endpoint | Verb | Description |
|---|---|---|
/v1/models |
GET | Returns a list of available models and their IDs. |
/v1/chat/completions |
POST | Sends chat-style prompts and returns model responses. |
/v1/responses |
POST | Sends chat-style prompts and returns text or multimodal model responses. |
/v1/images/generations |
POST | Generates images from text prompts. |
/v1/async-invoke |
POST | Sends text, image, or text-to-speech generation requests to fal models. |
We support both /v1/chat/completions and /v1/responses. Choose the endpoint that best fits your use case:
-
Use
/v1/chat/completionswhen building or maintaining chat-style integrations that rely on structuredmessageswith roles such assystem,user, andassistant, or when migrating existing chat-based code with minimal changes. -
Use
/v1/responseswhen building new integrations or working with newer models that only support the Responses API. It’s also useful for multi-step tool use in a single request, preserving state across turns withstore: true, and simplifying requests by using a singleinputfield with improved caching efficiency.
You can use these endpoints through cURL, Python OpenAI, or Gradient SDK.
Retrieve Available Models
The following cURL, Python OpenAI, and Gradient SDK examples show how to retrieve available models.
Send a GET request to the /v1/models endpoint using your model access key. For example:
curl -X GET https://inference.do-ai.run/v1/models \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json"This returns a list of available models with their corresponding model IDs (id):
...
{
"created": 1752255238,
"id": "alibaba-qwen3-32b",
"object": "model",
"owned_by": "digitalocean"
},
{
"created": 1737056613,
"id": "anthropic-claude-3.5-haiku",
"object": "model",
"owned_by": "anthropic"
},
...from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()
client = OpenAI(
base_url="https://inference.do-ai.run/v1/",
api_key=os.getenv("MODEL_ACCESS_KEY"),
)
models = client.models.list()
for m in models.data:
print("-", m.id)from gradient import Gradient
from dotenv import load_dotenv
import os
load_dotenv()
client = Gradient(model_access_key=os.getenv("MODEL_ACCESS_KEY"))
models = client.models.list()
print("Available models:")
for model in models.data:
print(f" - {model.id}")Send Prompt to a Model Using the Chat Completions API
The following cURL, Python OpenAI, and Gradient SDK examples show how to send a prompt to a model. Include your model access key and the following in your request:
-
model: The model ID of the model you want to use. Get the model ID using/v1/modelsor on the available models page. -
messages: The input prompt or conversation history. Serverless inference does not have sessions, so include all relevant context using this field. -
temperature: A value between0.0and1.0to control randomness and creativity. -
max_tokens: The maximum number of tokens to generate in the response. Use this to manage output length and cost.
Send a POST request to the /v1/chat/completions endpoint using your model access key.
The following example request sends a prompt to a Llama 3.3 Instruct-70B model with the prompt What is the capital of France?, a temperature of 0.7, and maximum number of tokens set to 50.
curl -X POST https://inference.do-ai.run/v1/chat/completions \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3-70b-instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"temperature": 0.7,
"max_tokens": 50
}'The response includes the generated text and token usage details:
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"audio": null,
"content": "The capital of France is Paris.",
"refusal": null,
"role": ""
}
}
],
"created": 1747247763,
"id": "",
"model": "llama3.3-70b-instruct",
"object": "chat.completion",
"service_tier": null,
"usage": {
"completion_tokens": 8,
"prompt_tokens": 43,
"total_tokens": 51
}
}from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()
client = OpenAI(
base_url="https://inference.do-ai.run/v1/",
api_key=os.getenv("MODEL_ACCESS_KEY"),
)
resp = client.chat.completions.create(
model="llama3-8b-instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a fun fact about octopuses."}
],
)
print(resp.choices[0].message.content)from gradient import Gradient
from dotenv import load_dotenv
import os
load_dotenv()
client = Gradient(model_access_key=os.getenv("MODEL_ACCESS_KEY"))
resp = client.chat.completions.create(
model="llama3-8b-instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a fun fact about octopuses."}
],
)
print(resp.choices[0].message.content)To use prompt caching, specify the cache_control parameter with type: ephemeral and ttl in your JSON request body. The ttl value can be 5m (default) or 1h. The following examples show how to use the cache_control parameter.
{
"role": "user",
"content": {
"type": "text",
"text": "This is cached for 1h.",
"cache_control": {
"type": "ephemeral",
"ttl": "1h"
}
}
}{
"role": "developer",
"content": [
{
"type": "text",
"text": "Cache this segment for 5 minutes.",
"cache_control": {
"type": "ephemeral",
"ttl": "5m"
}
},
{
"type": "text",
"text": "Do not cache this segment"
}
]
}{
"role": "tool",
"tool_call_id": "tool_call_id",
"content": [
{
"type": "text",
"text": "Tool output cached for 5m.",
"cache_control": {
"type": "ephemeral",
"ttl": "5m"
}
}
]
}The JSON response looks similar to the following and shows the number of input tokens cached during this request:
"usage": {
"cache_created_input_tokens": 1043,
"cache_creation": {
"ephemeral_1h_input_tokens": 0,
"ephemeral_5m_input_tokens": 1043
},
"cache_read_input_tokens": 0,
"completion_tokens": 100,
"prompt_tokens": 14,
"total_tokens": 114
}If you send the request again, cached input tokens are used and the response looks like this:
"usage": {
"cache_created_input_tokens": 0,
"cache_creation": {
"ephemeral_1h_input_tokens": 0,
"ephemeral_5m_input_tokens": 0
},
"cache_read_input_tokens": 1043,
"completion_tokens": 100,
"prompt_tokens": 14,
"total_tokens": 114
}Send Prompt to a Model Using the Responses API
The following cURL, Python OpenAI, and Gradient SDK examples show how to send a prompt using the /v1/responses endpoint. Include your model access key and the following in your request:
-
model: The model ID of the model you want to use. Get the model ID using/v1/modelsor on the available models page. -
input: The prompt or input content you want the model to respond to. -
max_output_tokens: The maximum number of tokens to generate in the response. -
temperature: A value between0.0and1.0to control randomness and creativity. -
stream: Set totrueto stream partial responses.
Send a POST request to the /v1/responses endpoint using your model access key.
The following example request sends a prompt to an OpenAI GPT-OSS-20B model with the prompt What is the capital of France?, a temperature of 0.7, and maximum number of output tokens set to 50.
curl -sS -X POST https://inference.do-ai.run/v1/responses \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai-gpt-oss-20b",
"input": "What is the capital of France?",
"max_output_tokens": 50,
"temperature": 0.7,
"stream": false
}'The response includes structured output and token usage details:
{
...
"output": [
{
"content": [
{
"text": "We need to answer: The capital of France is Paris. This is straightforward.",
"type": "reasoning_text"
}
],
...
},
{
"content": [
{
"text": "The capital of France is **Paris**.",
"type": "output_text"
}
],
...
}
],
...
"usage": {
"input_tokens": 72,
"input_tokens_details": {
"cached_tokens": 32
},
"output_tokens": 35,
"output_tokens_details": {
"reasoning_tokens": 17,
"tool_output_tokens": 0
},
"total_tokens": 107
},
...
}from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()
client = OpenAI(
base_url="https://inference.do-ai.run/v1/",
api_key=os.getenv("MODEL_ACCESS_KEY"),
)
resp = client.responses.create(
model="openai-gpt-oss-20b",
input="What is the capital of France?",
max_output_tokens=50,
temperature=0.7,
)
print(resp.output[1].content[0].text)from gradient import Gradient
from dotenv import load_dotenv
import os
load_dotenv()
client = Gradient(model_access_key=os.getenv("MODEL_ACCESS_KEY"))
resp = client.responses.create(
model="openai-gpt-oss-20b",
input="What is the capital of France?",
max_output_tokens=50,
temperature=0.7,
)
print(resp.output[1].content[0].text)Generate Image
The following cURL, Python OpenAI, and Gradient SDK examples show how to generate an image from a text prompt. Include your model access key and the following in your request:
-
model: The model ID of the image generation model you want to use. Get the model ID using/v1/modelsor on the available models page. -
prompt: The text prompt to generate the image from. -
n: The number of images to generate. Must be between 1 and 10. -
size: The desired dimensions of the generated image. Supported values are256x256,512x512, and1024x1024.
Make sure to always specify n and size when generating images.
Send a POST request to the /v1/images/generations endpoint using your model access key.
The following example request sends a prompt to the openai-gpt-image-1 model to generate an image of a baby sea otter floating on its back in calm blue water, with an image size of 1024x1024:
curl -X POST https://inference.do-ai.run/v1/images/generations \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-d '{
"model": "openai-gpt-image-1",
"prompt": "A cute baby sea otter floating on its back in calm blue water",
"n": 1,
"size": "1024x1024"
}'The response includes a JSON object with a Base64 image string and other details such as image format and tokens used:
{
"background": "opaque",
"created": 1770659857,
"data": [
{
"b64_json": "iVBORw0KGgoAAAANSUhEU...
}
],
"output_format": "png",
"quality": "medium",
"size": "1024x1024",
"usage": {
"input_tokens": 20,
"input_tokens_details": {
"text_tokens": 20
},
"output_tokens": 1056,
"total_tokens": 1076
}
}If you want to save the image as a file, pipe the image string to a file using jq and base64:
curl -X POST https://inference.do-ai.run/v1/images/generations \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-d '{
"model": "openai-gpt-image-1",
"prompt": "A cute baby sea otter floating on its back in calm blue water",
"n": 1,
"size": "1024x1024"
}' | jq -r '.data[0].b64_json' | base64 --decode > sea_otter.pngAn image named sea_otter.png is created in your current directory after a few seconds.
from openai import OpenAI
from dotenv import load_dotenv
import os, base64
load_dotenv()
client = OpenAI(
base_url="https://inference.do-ai.run/v1/",
api_key=os.getenv("MODEL_ACCESS_KEY"),
)
result = client.images.generate(
model="openai-gpt-image-1",
prompt="A cute baby sea otter, children’s book drawing style",
size="1024x1024",
n=1
)
b64 = result.data[0].b64_json
with open("sea_otter.png", "wb") as f:
f.write(base64.b64decode(b64))
print("Saved sea_otter.png")from gradient import Gradient
from dotenv import load_dotenv
import os, base64
load_dotenv()
client = Gradient(model_access_key=os.getenv("MODEL_ACCESS_KEY"))
result = client.images.generations.create(
model="openai-gpt-image-1",
prompt="A cute baby sea otter, children’s book drawing style",
size="1024x1024",
n=1
)
b64 = result.data[0].b64_json
with open("sea_otter.png", "wb") as f:
f.write(base64.b64decode(b64))
print("Saved sea_otter.png")Generate Image, Audio, or Text-to-Speech Using fal Models
The following examples show how to generate an image or audio clip, or use text-to-speech with fal models with the /v1/async-invoke endpoint.
The following example sends a request to generate an image using the fal-ai/fast-sdxl model.
curl -X POST 'https://inference.do-ai.run/v1/async-invoke' \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json" \
-d '{
"model_id": "fal-ai/flux/schnell",
"input": { "prompt": "A futuristic city at sunset" }
}'You can update the image generation request to also include the output format, number of inference steps, guidance scale, number of images to generate, and safety checker option:
curl -X POST https://inference.do-ai.run/v1/async-invoke \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json" \
-d '{
"model_id": "fal-ai/fast-sdxl",
"input": {
"prompt": "A futuristic cityscape at sunset, with flying cars and towering skyscrapers.",
"output_format": "landscape_4_3",
"num_inference_steps": 4,
"guidance_scale": 3.5,
"num_images": 1,
"enable_safety_checker": true
},
"tags": [
{"key": "type", "value": "test"}
]
}'The following example sends a request to generate a 60 second audio clip using the fal-ai/stable-audio-25/text-to-audio model:
curl -X POST 'https://inference.do-ai.run/v1/async-invoke' \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json" \
-d '{
"model_id": "fal-ai/stable-audio-25/text-to-audio",
"input": {
"prompt": "Techno song with futuristic sounds",
"seconds_total": 60
},
"tags": [
{ "key": "type", "value": "test" }
]
}'The following example sends a request to generate text-to-speech audio using the fal-ai/multilingual-tts-v2 model:
curl -X POST 'https://inference.do-ai.run/v1/async-invoke' \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json" \
-d '{
"model_id": "fal-ai/elevenlabs/tts/multilingual-v2",
"input": {
"text": "This text-to-speech example uses DigitalOcean multilingual voice."
},
"tags": [
{ "key": "type", "value": "test" }
]
}'When you send a request to the /v1/async-invoke endpoint, it starts an asynchronous job for the image, audio, or text-to-speech generation and returns a request_id. The job status is QUEUED initially and the response looks similar to the following:
"completed_at": null,
"created_at": "2026-01-22T19:19:19.112403432Z",
"error": null,
"model_id": "fal-ai/fast-sdxl",
"output": null,
"request_id": "6590784a-ce47-4556-9ff4-53baff2693fb",
"started_at": null,
"status": "QUEUED"Query the status endpoint frequently using the request_id to check the progress of the job:
curl -X GET "https://inference.do-ai.run/v1/async-invoke/<request_id>/status" \
-H "Authorization: Bearer $MODEL_ACCESS_KEY"
When the job completes, the status updates to COMPLETE. You can then use the /async-invoke/<request_id> endpoint to fetch the complete generated result:
curl -X GET "https://inference.do-ai.run/v1/async-invoke/<request_id>" \
-H "Authorization: Bearer $MODEL_ACCESS_KEY"
The response includes a URL to the generated image, audio, or text-to-speech file, which you can download or open directly in your browser or app:
{
...
"images": [
{
"content_type": "image/jpeg",
"height": 768,
"url": "https://v3b.fal.media/files/b/0a8b7281/HpQcEqkz-xy2ZI5do9Lyp.jpg",
"width": 1024
}
...
},
"request_id": "6f76e8f7-f6b4-4e20-ab9a-ca0f01a9d2f4",
"started_at": null,
"status": "COMPLETED"
}Alternatively, you can call serverless inference from your automation workflows. The n8n community node connects to any DigitalOcean-hosted model using your model access key. You can self-host n8n using the n8n Marketplace app.
Model Access Keys
You can create and manage model access keys in the Model Access Keys section of the Serverless inference page in the DigitalOcean Control Panel or using the API.
Create Keys
To create a model access key, click Create model access key to open the Add model access key window. In the Key name field, enter a name for your model access key, then click Add model access key.
Your new model access key with its creation date appears in the Model Access Keys section. The secret key is visible only once, immediately after creation, so copy and store it securely.
Model access keys are private and incur usage-based charges. Do not share them or expose them in front-end code. We recommend storing them using a secrets manager (for example, AWS Secrets Manager, HashiCorp Vault, or 1Password) or a secure environment variable in your deployment configuration.
Rename Keys
Renaming a model access key can help you organize and manage your keys more effectively, especially when using multiple keys for different projects or environments.
To rename a key, click … to the right of the key in the list to open the key’s menu, then click Rename. In the Rename model access key window that opens, in the Key name field, enter a new name for your key and then click UPDATE.
Regenerate Keys
Regenerating a model access key creates a new secret key and immediately and permanently invalidates the previous one. If a key has been compromised or want to rotate keys for security purposes, regenerate the key, then update any applications using the previous key to use the new key.
To regenerate a key, click … to the right of the key in the list to open the key’s menu, then click Regenerate. In the Regenerate model access key window that opens, enter the name of your key to confirm the action, then click Regenerate access key. Your new secret key is displayed in the Model Access Keys section.
Delete Keys
Deleting a model access key permanently and irreversibly destroys it. Any applications using a destroyed key lose access to the API.
To delete a key, click … to the right of the key in the list to open the key’s menu, then click Delete. In the Delete model access key window that opens, type the name of the key to confirm the deletion, then click Delete access key.