How to Use Batch Inference on DigitalOcean AI
Validated on 27 Apr 2026 • Last edited on 27 Apr 2026
Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.
Batch inference lets you run large collections of LLM requests as a single asynchronous job and retrieve results when processing completes, typically within 24 hours.
Using batch inference, you run large sets of text prompts asynchronously with OpenAI and Anthropic models. You use the same model access key, and send requests to the same serverless inference base URL (https://inference.do-ai.run). DigitalOcean forwards compatible batch traffic to the model provider.
Use Batch Inferencing Using the API
Batch inference follows a three-step asynchronous pattern. First, prepare and submit your input file, and create the batch job. Then, poll the job for results. When the job completes, download the results.
Prepare Your Input File
You need to submit batch inputs as a JSONL file (JSON Lines), where each line represents one inference request. The file size must be less than or equal to 200 MB, and can contain no more than 50,000 requests per file. Each line follows the batch input schema for the provider you use:
{"custom_id": "req-1", "method": "POST", "url": "/v1/responses", "body": {"model": "gpt-4.1-mini", "input": "Summarize the following article: ...", "max_output_tokens": 256}}
{"custom_id": "req-2", "method": "POST", "url": "/v1/responses", "body": {"model": "gpt-4.1-mini", "input": "Classify this review as positive or negative: ...", "temperature": 0.2}}{"custom_id": "req-1", "params": {"model": "claude-3-5-sonnet-latest", "messages": [{"role": "user", "content": [{"type": "text", "text": "Summarize this document: ..."}]}], "max_tokens": 256}}
{"custom_id": "req-2", "params": {"model": "claude-3-5-sonnet-latest", "messages": [{"role": "user", "content": [{"type": "text", "text": "Extract key entities from: ..."}]}], "max_tokens": 256, "temperature": 0.2}}custom_id must be unique within the file. Duplicate custom_id values cause validation to fail.
Upload Your File
Before creating a batch job, upload your JSONL file to get a file_id. We use a two-step presigned upload to avoid routing large payloads through the API gateway. Send the following request with the name of your file to get the file_id and a presigned URL:
curl -X POST https://inference.do-ai.run/v1/batches/files \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json" \
-d '{
"file_name": "<your-file-name>.jsonl"
}'
The response looks similar to the following:
{
"file_id": "file_abc123",
"upload_url": "<presigned_spaces_url>",
"expires_at": "2026-04-01T12:15:00Z"
}Note the file_id and upload_url values. The file_id is valid for 29-30 days and you can reuse it across multiple batch jobs. The presigned URL is valid for 10–15 minutes. Upload your file before it expires.
Next, upload the file directly to storage using the following request:
curl -X PUT "<presigned_spaces_url>" \
-H "Content-Type: application/jsonl" \
--data-binary @<your-file-name>.jsonl
If the upload fails, retry using the same presigned URL until it expires. If it expires, get a new one by re-requesting the presigned URL.
import os
# Request presigned URL
upload_info = client.batches.files.create(
file_name="eval_prompts.jsonl",
file_size_in_bytes=os.path.getsize("eval_prompts.jsonl")
)
# Upload file
import requests
with open("eval_prompts.jsonl", "rb") as f:
requests.put(upload_info.upload_url, data=f, headers={"Content-Type": "application/jsonl"})
file_id = upload_info.file_idCreate the Batch Job
Once your file is uploaded, create the batch job by using the file_id.
curl -X POST https://inference.do-ai.run/v1/batches \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json" \
-d '{
"file_id": "<your-file-id>",
"completion_window": "24h",
"parameters": {
"temperature": 0.2,
"max_tokens": 1024
}
}'
The response looks similar to the following:
{
"id": "batch_xyz789",
"status": "validating",
"model": "gpt-4o-mini",
"file_id": "file_abc123",
"completion_window": "24h",
"created_at": "2026-04-01T10:00:00Z",
"request_counts": {
"total": 0,
"completed": 0,
"failed": 0
}
}Note the id value for the ID of the batch job.
batch = client.batches.create(
file_id=file_id,
model="gpt-4o-mini",
completion_window="24h",
parameters={
"temperature": 0.2,
"max_tokens": 1024
},
webhook={
"url": "https://your-server.com/batch-webhook",
"secret": "your-webhook-secret"
}
)
print(f"Batch created: {batch.id}, status: {batch.status}")Monitor Job Status
Poll the batch status endpoint to track job progress. The status is updated in near real-time.
curl https://inference.do-ai.run/v1/batches/<your-batch-job-id> \
-H "Authorization: Bearer $MODEL_ACCESS_KEY"
The following response is returned:
{
"id": "batch_xyz789",
"status": "in_progress",
"model": "gpt-4o-mini",
"created_at": "2026-04-01T10:00:00Z",
"request_counts": {
"total": 50000,
"completed": 23410,
"failed": 12
},
"usage": {
"input_tokens": 14200000,
"output_tokens": 3100000,
"cached_tokens": 800000
}
}The job status is shown in the status field.
import time
while True:
batch = client.batches.retrieve("batch_xyz789")
print(f"Status: {batch.status} | Completed: {batch.request_counts.completed}/{batch.request_counts.total}")
if batch.status in ("completed", "failed", "expired", "cancelled"):
break
time.sleep(60) # Poll every 60 secondsJobs progress through the following states:
-
validating: The platform is checking the JSONL file structure, uniquecustom_idvalues, token counts, and other basic checks. -
queued: Validation passed. The job is waiting for compute resources. -
in_progress: A worker is actively executing inference requests. -
completed: All requests have been processed. This state is reached even if some individual requests failed. Failed requests are logged in the error file. -
failed: The entire job failed due to a systemic or unrecoverable error such as complete file validation failure. -
cancelling: A cancellation was requested. In-flight requests may still complete. -
cancelled: The job was cancelled. Results for completed requests are preserved and available. -
expired: The job exceeded the 24-hour completion window. Results for any completed requests are preserved and available.
Download Results
Using GET /v1/batches/{batch_id} returns result_available: true when results are ready. Once the job reaches a terminal state (completed, cancelled, or expired), retrieve your results.
curl https://inference.do-ai.run/v1/batches/<your-batch-job-id>/results \
-H "Authorization: Bearer $MODEL_ACCESS_KEY"
The response looks like the following:
{
"output_url": "<presigned_download_url>",
"error_url": "<presigned_download_url>",
"expires_at": "2026-04-01T11:00:00Z"
}Each call to this endpoint results in a new presigned URL. Download the files before the URL expires.
The output file is a JSONL file where each line corresponds to one request from your input. Each line includes the following fields and values:
{"custom_id": "req-1", "response": {"id": "chatcmpl-...", "choices": [{"message": {"role": "assistant", "content": "Summary: ..."}}], "usage": {"prompt_tokens": 312, "completion_tokens": 89}}, "error": null}
{"custom_id": "req-2", "response": null, "error": {"code": "context_length_exceeded", "message": "Request exceeded maximum context length."}}Requests that failed are written to a separate error file that has the following fields and values:
{"custom_id": "req-2", "error": {"code": "context_length_exceeded", "message": "Request exceeded maximum context length."}}
{"custom_id": "req-47", "error": {"code": "content_policy_violation", "message": "Request was blocked by content moderation."}}import requests
result_info = client.batches.results("batch_xyz789")
# Download output file
output = requests.get(result_info.output_url)
with open("batch_output.jsonl", "wb") as f:
f.write(output.content)
# Parse results
with open("batch_output.jsonl") as f:
for line in f:
record = json.loads(line)
if record["error"] is None:
print(record["custom_id"], record["response"]["choices"][0]["message"]["content"])
else:
print(f"Failed: {record['custom_id']} - {record['error']['message']}")Cancel a Batch Job
You can cancel a batch job at any time before it reaches a terminal state. Results for requests that were already completed before cancellation are preserved and billed. Incomplete requests are not billed.
curl -X POST https://inference.do-ai.run/v1/batches/<your-batch-job-id>/cancel \
-H "Authorization: Bearer $MODEL_ACCESS_KEY"
The response looks like the following:
{
"id": "batch_xyz789",
"status": "cancelling"
}The job transitions to a cancelling status immediately. Because both OpenAI and Anthropic process cancellations asynchronously, the job remains in cancelling until the provider confirms the final state, at which point it transitions to cancelled. Continue polling until you see the cancelled status.
client.batches.cancel("batch_xyz789")
# Poll until terminal
while True:
batch = client.batches.retrieve("batch_xyz789")
if batch.status in ("cancelled", "completed", "failed", "expired"):
break
time.sleep(30)
print(f"Final status: {batch.status}")
print(f"Completed requests: {batch.request_counts.completed}")List Batch Jobs
To retrieve a list of all batch jobs, use one of the following:
curl https://inference.do-ai.run/v1/batches \
-H "Authorization: Bearer $MODEL_ACCESS_KEY"
batches = client.batches.list()
for batch in batches:
print(batch.id, batch.status, batch.created_at)Use Webhooks
To receive a notification when the job reaches a terminal state (instead of polling), you can configure a webhook URL when creating a batch job. The request must include url and secret for the webhook in the JSON body:
curl -X POST https://inference.do-ai.run/v1/batches \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json" \
-d '{
"file_id": "<your-file-id>",
"completion_window": "24h",
"parameters": {
"temperature": 0.2,
"max_tokens": 1024
},
"webhook": {
"url": "https://your-server.com/batch-webhook",
"secret": "your-webhook-secret"
}
}'
We retry to deliver the webhook up to 3 times with exponential backoff in case of delivery failures. Polling is active as the fallback method. Use polling for mission-critical workflows.
Full Python Example
The following is an end-to-end example showing how to create and run a batch inference job: