How to Use Batch Inference on DigitalOcean AI

Validated on 27 Apr 2026 • Last edited on 27 Apr 2026

Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.

Batch inference lets you run large collections of LLM requests as a single asynchronous job and retrieve results when processing completes, typically within 24 hours.

Using batch inference, you run large sets of text prompts asynchronously with OpenAI and Anthropic models. You use the same model access key, and send requests to the same serverless inference base URL (https://inference.do-ai.run). DigitalOcean forwards compatible batch traffic to the model provider.

Note
Only text prompts for OpenAI and Anthropic commercial models are supported for batch inference.

Use Batch Inferencing Using the API

Batch inference follows a three-step asynchronous pattern. First, prepare and submit your input file, and create the batch job. Then, poll the job for results. When the job completes, download the results.

Prepare Your Input File

You need to submit batch inputs as a JSONL file (JSON Lines), where each line represents one inference request. The file size must be less than or equal to 200 MB, and can contain no more than 50,000 requests per file. Each line follows the batch input schema for the provider you use:

{"custom_id": "req-1", "method": "POST", "url": "/v1/responses", "body": {"model": "gpt-4.1-mini", "input": "Summarize the following article: ...", "max_output_tokens": 256}}
{"custom_id": "req-2", "method": "POST", "url": "/v1/responses", "body": {"model": "gpt-4.1-mini", "input": "Classify this review as positive or negative: ...", "temperature": 0.2}}
{"custom_id": "req-1", "params": {"model": "claude-3-5-sonnet-latest", "messages": [{"role": "user", "content": [{"type": "text", "text": "Summarize this document: ..."}]}], "max_tokens": 256}}
{"custom_id": "req-2", "params": {"model": "claude-3-5-sonnet-latest", "messages": [{"role": "user", "content": [{"type": "text", "text": "Extract key entities from: ..."}]}], "max_tokens": 256, "temperature": 0.2}}
Note
Every custom_id must be unique within the file. Duplicate custom_id values cause validation to fail.

Upload Your File

Before creating a batch job, upload your JSONL file to get a file_id. We use a two-step presigned upload to avoid routing large payloads through the API gateway. Send the following request with the name of your file to get the file_id and a presigned URL:

curl -X POST https://inference.do-ai.run/v1/batches/files \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_name": "<your-file-name>.jsonl"
  }'

The response looks similar to the following:

{
  "file_id": "file_abc123",
  "upload_url": "<presigned_spaces_url>",
  "expires_at": "2026-04-01T12:15:00Z"
}

Note the file_id and upload_url values. The file_id is valid for 29-30 days and you can reuse it across multiple batch jobs. The presigned URL is valid for 10–15 minutes. Upload your file before it expires.

Next, upload the file directly to storage using the following request:

curl -X PUT "<presigned_spaces_url>" \
  -H "Content-Type: application/jsonl" \
  --data-binary @<your-file-name>.jsonl

If the upload fails, retry using the same presigned URL until it expires. If it expires, get a new one by re-requesting the presigned URL.

import os

# Request presigned URL
upload_info = client.batches.files.create(
    file_name="eval_prompts.jsonl",
    file_size_in_bytes=os.path.getsize("eval_prompts.jsonl")
)

# Upload file
import requests
with open("eval_prompts.jsonl", "rb") as f:
    requests.put(upload_info.upload_url, data=f, headers={"Content-Type": "application/jsonl"})

file_id = upload_info.file_id

Create the Batch Job

Once your file is uploaded, create the batch job by using the file_id.

curl -X POST https://inference.do-ai.run/v1/batches \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "<your-file-id>",
    "completion_window": "24h",
    "parameters": {
      "temperature": 0.2,
      "max_tokens": 1024
    }
  }'

The response looks similar to the following:

{
  "id": "batch_xyz789",
  "status": "validating",
  "model": "gpt-4o-mini",
  "file_id": "file_abc123",
  "completion_window": "24h",
  "created_at": "2026-04-01T10:00:00Z",
  "request_counts": {
    "total": 0,
    "completed": 0,
    "failed": 0
  }
}

Note the id value for the ID of the batch job.

batch = client.batches.create(
    file_id=file_id,
    model="gpt-4o-mini",
    completion_window="24h",
    parameters={
        "temperature": 0.2,
        "max_tokens": 1024
    },
    webhook={
        "url": "https://your-server.com/batch-webhook",
        "secret": "your-webhook-secret"
    }
)
print(f"Batch created: {batch.id}, status: {batch.status}")

Monitor Job Status

Poll the batch status endpoint to track job progress. The status is updated in near real-time.

curl https://inference.do-ai.run/v1/batches/<your-batch-job-id> \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY"

The following response is returned:

{
  "id": "batch_xyz789",
  "status": "in_progress",
  "model": "gpt-4o-mini",
  "created_at": "2026-04-01T10:00:00Z",
  "request_counts": {
    "total": 50000,
    "completed": 23410,
    "failed": 12
  },
  "usage": {
    "input_tokens": 14200000,
    "output_tokens": 3100000,
    "cached_tokens": 800000
  }
}

The job status is shown in the status field.

import time

while True:
    batch = client.batches.retrieve("batch_xyz789")
    print(f"Status: {batch.status} | Completed: {batch.request_counts.completed}/{batch.request_counts.total}")

    if batch.status in ("completed", "failed", "expired", "cancelled"):
        break

    time.sleep(60)  # Poll every 60 seconds

Jobs progress through the following states:

  • validating: The platform is checking the JSONL file structure, unique custom_id values, token counts, and other basic checks.

  • queued: Validation passed. The job is waiting for compute resources.

  • in_progress: A worker is actively executing inference requests.

  • completed: All requests have been processed. This state is reached even if some individual requests failed. Failed requests are logged in the error file.

  • failed: The entire job failed due to a systemic or unrecoverable error such as complete file validation failure.

  • cancelling: A cancellation was requested. In-flight requests may still complete.

  • cancelled: The job was cancelled. Results for completed requests are preserved and available.

  • expired: The job exceeded the 24-hour completion window. Results for any completed requests are preserved and available.

Download Results

Using GET /v1/batches/{batch_id} returns result_available: true when results are ready. Once the job reaches a terminal state (completed, cancelled, or expired), retrieve your results.

Warning
Output files are available for download for 14–30 days after job completion. After this window, results are permanently deleted and cannot be recovered.
curl https://inference.do-ai.run/v1/batches/<your-batch-job-id>/results \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY"

The response looks like the following:

{
  "output_url": "<presigned_download_url>",
  "error_url": "<presigned_download_url>",
  "expires_at": "2026-04-01T11:00:00Z"
}

Each call to this endpoint results in a new presigned URL. Download the files before the URL expires.

The output file is a JSONL file where each line corresponds to one request from your input. Each line includes the following fields and values:

{"custom_id": "req-1", "response": {"id": "chatcmpl-...", "choices": [{"message": {"role": "assistant", "content": "Summary: ..."}}], "usage": {"prompt_tokens": 312, "completion_tokens": 89}}, "error": null}
{"custom_id": "req-2", "response": null, "error": {"code": "context_length_exceeded", "message": "Request exceeded maximum context length."}}

Requests that failed are written to a separate error file that has the following fields and values:

{"custom_id": "req-2", "error": {"code": "context_length_exceeded", "message": "Request exceeded maximum context length."}}
{"custom_id": "req-47", "error": {"code": "content_policy_violation", "message": "Request was blocked by content moderation."}}
import requests

result_info = client.batches.results("batch_xyz789")

# Download output file
output = requests.get(result_info.output_url)
with open("batch_output.jsonl", "wb") as f:
    f.write(output.content)

# Parse results
with open("batch_output.jsonl") as f:
    for line in f:
        record = json.loads(line)
        if record["error"] is None:
            print(record["custom_id"], record["response"]["choices"][0]["message"]["content"])
        else:
            print(f"Failed: {record['custom_id']} - {record['error']['message']}")

Cancel a Batch Job

You can cancel a batch job at any time before it reaches a terminal state. Results for requests that were already completed before cancellation are preserved and billed. Incomplete requests are not billed.

curl -X POST https://inference.do-ai.run/v1/batches/<your-batch-job-id>/cancel \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY"

The response looks like the following:

{
  "id": "batch_xyz789",
  "status": "cancelling"
}

The job transitions to a cancelling status immediately. Because both OpenAI and Anthropic process cancellations asynchronously, the job remains in cancelling until the provider confirms the final state, at which point it transitions to cancelled. Continue polling until you see the cancelled status.

client.batches.cancel("batch_xyz789")

# Poll until terminal
while True:
    batch = client.batches.retrieve("batch_xyz789")
    if batch.status in ("cancelled", "completed", "failed", "expired"):
        break
    time.sleep(30)

print(f"Final status: {batch.status}")
print(f"Completed requests: {batch.request_counts.completed}")

List Batch Jobs

To retrieve a list of all batch jobs, use one of the following:

curl https://inference.do-ai.run/v1/batches \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY"
batches = client.batches.list()
for batch in batches:
    print(batch.id, batch.status, batch.created_at)

Use Webhooks

To receive a notification when the job reaches a terminal state (instead of polling), you can configure a webhook URL when creating a batch job. The request must include url and secret for the webhook in the JSON body:

curl -X POST https://inference.do-ai.run/v1/batches \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file_id": "<your-file-id>",
    "completion_window": "24h",
    "parameters": {
      "temperature": 0.2,
      "max_tokens": 1024
    },
    "webhook": {
      "url": "https://your-server.com/batch-webhook",
      "secret": "your-webhook-secret"
    }
  }'

We retry to deliver the webhook up to 3 times with exponential backoff in case of delivery failures. Polling is active as the fallback method. Use polling for mission-critical workflows.

Full Python Example

The following is an end-to-end example showing how to create and run a batch inference job:

Full Python example
import json
import os
import time
import requests
from gradient_ai import GradientAI

client = GradientAI(api_key=os.environ["GRADIENT_API_KEY"])

# Step 1: Write input file
with open("prompts.jsonl", "w") as f:
    for i, prompt in enumerate(my_prompts):
        f.write(json.dumps({
            "custom_id": f"req-{i}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": 256
            }
        }) + "\n")

# Step 2: Upload file
upload_info = client.batches.files.create(
    file_name="prompts.jsonl",
    file_size_in_bytes=os.path.getsize("prompts.jsonl")
)
with open("prompts.jsonl", "rb") as f:
    requests.put(upload_info.upload_url, data=f, headers={"Content-Type": "application/jsonl"})

# Step 3: Create batch job
batch = client.batches.create(
    file_id=upload_info.file_id,
    model="gpt-4o-mini",
    completion_window="24h",
    parameters={"temperature": 0.0, "max_tokens": 256}
)
print(f"Batch submitted: {batch.id}")

# Step 4: Poll for completion
while True:
    batch = client.batches.retrieve(batch.id)
    print(f"[{batch.status}] {batch.request_counts.completed}/{batch.request_counts.total} complete")
    if batch.status in ("completed", "failed", "expired", "cancelled"):
        break
    time.sleep(60)

# Step 5: Download and parse results
if batch.status == "completed":
    result_info = client.batches.results(batch.id)
    output = requests.get(result_info.output_url)
    results = {}
    for line in output.text.strip().split("\n"):
        record = json.loads(line)
        if record["error"] is None:
            results[record["custom_id"]] = record["response"]["choices"][0]["message"]["content"]
    print(f"Downloaded {len(results)} results.")

We can't find any results for your search.

Try using different keywords or simplifying your search terms.