Inference APIs
Generated on 1 Jan 0001
The DigitalOcean Gradient™ AI Agentic Cloud provides inference capabilities through two API variants:
- Serverless Inference: Access serverless inference models using an inference key
at
https://inference.do-ai.run. - Agent Inference: Integrate multi-agent workflows into your AI applications using a customer-specific agent endpoint.
These APIs are independent of the main DigitalOcean control-plane API (https://api.digitalocean.com).
Base URLs
| API | Base URL |
|---|---|
| Serverless Inference | https://inference.do-ai.run |
| Agent Inference | https://{your-agent-url}.agents.do-ai.run |
Requests
Any tool that is fluent in HTTP can communicate with the API simply by requesting the correct URI. Requests should be made using the HTTPS protocol so that traffic is encrypted. The interface responds to different methods depending on the action required.
| Method | Usage |
|---|---|
| GET | For simple retrieval of information about your account, Droplets, or environment, you should use the GET method. The information you request will be returned to you as a JSON object. The attributes defined by the JSON object can be used to form additional requests. Any request using the GET method is read-only and will not affect any of the objects you are querying. |
| DELETE | To destroy a resource and remove it from your account and environment, the DELETE method should be used. This will remove the specified object if it is found. If it is not found, the operation will return a response indicating that the object was not found. This idempotency means that you do not have to check for a resource’s availability prior to issuing a delete command, the final state will be the same regardless of its existence. |
| PUT | To update the information about a resource in your account, the PUT method is available. Like the DELETE Method, the PUT method is idempotent. It sets the state of the target using the provided values, regardless of their current values. Requests using the PUT method do not need to check the current attributes of the object. |
| PATCH | Some resources support partial modification. In these cases, the PATCH method is available. Unlike PUT which generally requires a complete representation of a resource, a PATCH request is a set of instructions on how to modify a resource updating only specific attributes. |
| POST | To create a new object, your request should specify the POST method. The POST request includes all of the attributes necessary to create a new object. When you wish to create a new object, send a POST request to the target endpoint. |
| HEAD | Finally, to retrieve metadata information, you should use the HEAD method to get the headers. This returns only the header of what would be returned with an associated GET request. Response headers contain some useful information about your API access and the results that are available for your request. For instance, the headers contain your current rate-limit value and the amount of time available until the limit resets. It also contains metrics about the total number of objects found, pagination information, and the total content length. |
HTTP Statuses
Along with the HTTP methods that the API responds to, it will also return standard HTTP statuses, including error codes.
In the event of a problem, the status will contain the error code, while the body of the response will usually contain additional information about the problem that was encountered.
In general, if the status returned is in the 200 range, it indicates that the request was fulfilled successfully and that no error was encountered.
Return codes in the 400 range typically indicate that there was an issue with the request that was sent. Among other things, this could mean that you did not authenticate correctly, that you are requesting an action that you do not have authorization for, that the object you are requesting does not exist, or that your request is malformed.
If you receive a status in the 500 range, this generally indicates a server-side problem. This means that we are having an issue on our end and cannot fulfill your request currently.
400 and 500 level error responses will include a JSON object in their body, including the following attributes:
| Name | Type | Description |
|---|---|---|
| id | string | A short identifier corresponding to the HTTP status code returned. For example, the ID for a response returning a 404 status code would be “not_found.” |
| message | string | A message providing additional information about the error, including details to help resolve it when possible. |
| request_id | string | Optionally, some endpoints may include a request ID that should be provided when reporting bugs or opening support tickets to help identify the issue. |
Example Error Response
HTTP/1.1 403 Forbidden
{
"id": "forbidden",
"message": "You do not have access for the attempted action."
}
Responses
When a request is successful, a response body will typically be sent back in one of two formats depending on the request parameters:
JSON Response (Non-Streaming)
By default, or when stream is set to false, the API returns a complete JSON
response once the inference is finished. The response contains the full generated
content in a single payload.
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1677858242,
"model": "meta-llama/Llama-3.3-70B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 12,
"total_tokens": 22
}
}Streaming Response (Server-Sent Events)
When stream is set to true, the API returns the response as Server-Sent Events
(SSE) with content type text/event-stream. This allows you to receive tokens as
they are generated, providing a more responsive user experience.
Each event in the stream is prefixed with data: and contains a JSON object
representing a chunk of the response. The stream ends with a data: [DONE] message.
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1677858242,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"role":"assistant","content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1677858242,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1677858242,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
To consume streaming responses, read the response body line by line, parse each
data: line as JSON, and process the delta.content field to build up the
complete response incrementally.
Rate Limit
Requests through the API are rate limited per OAuth token. Current rate limits:
- 5,000 requests per hour
- 250 requests per minute (5% of the hourly total)
Once you exceed either limit, you will be rate limited until the next cycle starts. Space out any requests that you would otherwise issue in bursts for the best results.
The rate limiting information is contained within the response headers of each request. The relevant headers are:
- ratelimit-limit: The number of requests that can be made per hour.
- ratelimit-remaining: The number of requests that remain before you hit your request limit. See the information below for how the request limits expire.
- ratelimit-reset: This represents the time when the oldest request will expire. The value is given in Unix epoch time. See below for more information about how request limits expire.
More rate limiting information is returned only within burst limit error response headers:
- retry-after: The number of seconds to wait before making another request when rate limited.
As long as the ratelimit-remaining count is above zero, you will be able
to make additional requests.
The way that a request expires and is removed from the current limit count
is important to understand. Rather than counting all of the requests for an
hour and resetting the ratelimit-remaining value at the end of the hour,
each request instead has its own timer.
This means that each request contributes toward the ratelimit-remaining
count for one complete hour after the request is made. When that request’s
timer runs out, it is no longer counted towards the request limit.
This has implications on the meaning of the ratelimit-reset header as
well. Because the entire rate limit is not reset at one time, the value of
this header is set to the time when the oldest request will expire.
Keep this in mind if you see your ratelimit-reset value change, but not
move an entire hour into the future.
If the ratelimit-remaining reaches zero, subsequent requests will receive
a 429 error code until the request reset has been reached.
ratelimit-remaining reaching zero can also indicate that the “burst limit” of 250
requests per minute limit was met, even if the 5,000 requests per hour limit was not.
In this case, the 429 error response will include a retry-after header to indicate how
long to wait (in seconds) until the request may be retried.
You can see the format of the response in the examples.
Sample Rate Limit Headers
...
ratelimit-limit: 1200
ratelimit-remaining: 1193
rateLimit-reset: 1402425459
...
Sample Rate Limit Headers When Burst Limit is Reached:
...
ratelimit-limit: 5000
ratelimit-remaining: 0
rateLimit-reset: 1402425459
retry-after: 29
...
Sample Rate Exceeded Response
429 Too Many Requests
{
id: "too_many_requests",
message: "API Rate limit exceeded."
}