Give Feedback

Inference APIs

Generated on 1 Jan 0001

Copy page as Markdown View page as Markdown

The DigitalOcean Gradient™ AI Agentic Cloud provides inference capabilities through two API variants:

Serverless Inference: Access serverless inference models using an inference key at https://inference.do-ai.run.
Agent Inference: Integrate multi-agent workflows into your AI applications using a customer-specific agent endpoint.

These APIs are independent of the main DigitalOcean control-plane API (https://api.digitalocean.com).

Base URLs

API	Base URL
Serverless Inference	`https://inference.do-ai.run`
Agent Inference	`https://{your-agent-url}.agents.do-ai.run`

Requests

Any tool that is fluent in HTTP can communicate with the API simply by requesting the correct URI. Requests should be made using the HTTPS protocol so that traffic is encrypted. The interface responds to different methods depending on the action required.

Method	Usage
GET	For simple retrieval of information about your account, Droplets, or environment, you should use the GET method. The information you request will be returned to you as a JSON object. The attributes defined by the JSON object can be used to form additional requests. Any request using the GET method is read-only and will not affect any of the objects you are querying.
DELETE	To destroy a resource and remove it from your account and environment, the DELETE method should be used. This will remove the specified object if it is found. If it is not found, the operation will return a response indicating that the object was not found. This idempotency means that you do not have to check for a resource’s availability prior to issuing a delete command, the final state will be the same regardless of its existence.
PUT	To update the information about a resource in your account, the PUT method is available. Like the DELETE Method, the PUT method is idempotent. It sets the state of the target using the provided values, regardless of their current values. Requests using the PUT method do not need to check the current attributes of the object.
PATCH	Some resources support partial modification. In these cases, the PATCH method is available. Unlike PUT which generally requires a complete representation of a resource, a PATCH request is a set of instructions on how to modify a resource updating only specific attributes.
POST	To create a new object, your request should specify the POST method. The POST request includes all of the attributes necessary to create a new object. When you wish to create a new object, send a POST request to the target endpoint.
HEAD	Finally, to retrieve metadata information, you should use the HEAD method to get the headers. This returns only the header of what would be returned with an associated GET request. Response headers contain some useful information about your API access and the results that are available for your request. For instance, the headers contain your current rate-limit value and the amount of time available until the limit resets. It also contains metrics about the total number of objects found, pagination information, and the total content length.

HTTP Statuses

Along with the HTTP methods that the API responds to, it will also return standard HTTP statuses, including error codes.

In the event of a problem, the status will contain the error code, while the body of the response will usually contain additional information about the problem that was encountered.

In general, if the status returned is in the 200 range, it indicates that the request was fulfilled successfully and that no error was encountered.

Return codes in the 400 range typically indicate that there was an issue with the request that was sent. Among other things, this could mean that you did not authenticate correctly, that you are requesting an action that you do not have authorization for, that the object you are requesting does not exist, or that your request is malformed.

If you receive a status in the 500 range, this generally indicates a server-side problem. This means that we are having an issue on our end and cannot fulfill your request currently.

400 and 500 level error responses will include a JSON object in their body, including the following attributes:

Name	Type	Description
id	string	A short identifier corresponding to the HTTP status code returned. For example, the ID for a response returning a 404 status code would be “not_found.”
message	string	A message providing additional information about the error, including details to help resolve it when possible.
request_id	string	Optionally, some endpoints may include a request ID that should be provided when reporting bugs or opening support tickets to help identify the issue.

Example Error Response

    HTTP/1.1 403 Forbidden
    {
      "id":       "forbidden",
      "message":  "You do not have access for the attempted action."
    }

Responses

When a request is successful, a response body will typically be sent back in one of two formats depending on the request parameters:

JSON Response (Non-Streaming)

By default, or when stream is set to false, the API returns a complete JSON response once the inference is finished. The response contains the full generated content in a single payload.

    {
        "id": "chatcmpl-abc123",
        "object": "chat.completion",
        "created": 1677858242,
        "model": "meta-llama/Llama-3.3-70B-Instruct",
        "choices": [
            {
                "index": 0,
                "message": {
                    "role": "assistant",
                    "content": "Hello! How can I help you today?"
                },
                "finish_reason": "stop"
            }
        ],
        "usage": {
            "prompt_tokens": 10,
            "completion_tokens": 12,
            "total_tokens": 22
        }
    }

Streaming Response (Server-Sent Events)

When stream is set to true, the API returns the response as Server-Sent Events (SSE) with content type text/event-stream. This allows you to receive tokens as they are generated, providing a more responsive user experience.

Each event in the stream is prefixed with data: and contains a JSON object representing a chunk of the response. The stream ends with a data: [DONE] message.

    data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1677858242,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]}

    data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1677858242,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

    data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1677858242,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

    data: [DONE]

To consume streaming responses, read the response body line by line, parse each data: line as JSON, and process the delta.content field to build up the complete response incrementally.

Rate Limit

Requests through the API are rate limited per OAuth token. Current rate limits:

5,000 requests per hour
250 requests per minute (5% of the hourly total)

Once you exceed either limit, you will be rate limited until the next cycle starts. Space out any requests that you would otherwise issue in bursts for the best results.

The rate limiting information is contained within the response headers of each request. The relevant headers are:

ratelimit-limit: The number of requests that can be made per hour.
ratelimit-remaining: The number of requests that remain before you hit your request limit. See the information below for how the request limits expire.
ratelimit-reset: This represents the time when the oldest request will expire. The value is given in Unix epoch time. See below for more information about how request limits expire.

More rate limiting information is returned only within burst limit error response headers:

retry-after: The number of seconds to wait before making another request when rate limited.

As long as the ratelimit-remaining count is above zero, you will be able to make additional requests.

The way that a request expires and is removed from the current limit count is important to understand. Rather than counting all of the requests for an hour and resetting the ratelimit-remaining value at the end of the hour, each request instead has its own timer.

This means that each request contributes toward the ratelimit-remaining count for one complete hour after the request is made. When that request’s timer runs out, it is no longer counted towards the request limit.

This has implications on the meaning of the ratelimit-reset header as well. Because the entire rate limit is not reset at one time, the value of this header is set to the time when the oldest request will expire.

Keep this in mind if you see your ratelimit-reset value change, but not move an entire hour into the future.

If the ratelimit-remaining reaches zero, subsequent requests will receive a 429 error code until the request reset has been reached.

ratelimit-remaining reaching zero can also indicate that the “burst limit” of 250 requests per minute limit was met, even if the 5,000 requests per hour limit was not. In this case, the 429 error response will include a retry-after header to indicate how long to wait (in seconds) until the request may be retried.

You can see the format of the response in the examples.

Sample Rate Limit Headers

    ...
    ratelimit-limit: 1200
    ratelimit-remaining: 1193
    rateLimit-reset: 1402425459
    ...

Sample Rate Limit Headers When Burst Limit is Reached:

    ...
    ratelimit-limit: 5000
    ratelimit-remaining: 0
    rateLimit-reset: 1402425459
    retry-after: 29
    ...

Sample Rate Exceeded Response

    429 Too Many Requests
    {
            id: "too_many_requests",
            message: "API Rate limit exceeded."
    }