Inference Limits
Validated on 27 Apr 2026 • Last edited on 27 Apr 2026
Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.
Model Catalog Limits
-
Model Catalog data currently cannot be retrieved through the DigitalOcean API.
-
The MCP server endpoint uses the standard API rate limits. If you need higher limits for production workloads, contact support.
Model Playground Limits
- Only images are supported for file uploads.
Serverless Inference Limits
-
Serverless inference supports the two to three most recent stable versions of each model to ensure consistent performance and reliable maintenance. For the list of supported models and versions, see the available model offerings.
-
Serverless inference model endpoints support OpenAI-compatible request formats but may not be compatible with all OpenAI tools and plugins.
-
Serverless inference provides access to commercial models, but not all model-specific features are supported. For example, features like Anthropic’s extended thinking are not available.
-
OpenAI models accessed through serverless inference not support zero data retention. If your use case requires strict data privacy or compliance, consider using a different model or contact support for guidance.
Dedicated Inference Limits
-
The number of endpoints you can create when using dedicated inference depends on the limits set for your account. We use dynamic resource limits to protect our platform against bad actors. To request a limit increase, contact support. If you are a team owner or resource modifier, you can check your resource limits and request an increase on the Resource Limits page in the DigitalOcean Control Panel.
-
Re-ranking, embedding, and audio/TTS models are not currently supported for deployment on a dedicated inference endpoint
Batch Inference
-
Open-source and DigitalOcean-hosted models are not supported for batch inference.
-
Only text prompts for OpenAI and Anthropic commercial models are supported. Multimodal requests and image generation batch jobs are not supported.
-
Each batch job uses a single model. Multi-model batch jobs are not supported.
-
Batch inference uses separate rate limits from serverless inference or dedicated inference:
Limit Default Enqueue token limit 10 billion tokens per model per account Requests per file 50,000 Maximum file size 200 MB Completion window 24 hours Concurrent batch jobs No hard limit (token-based quota applies) To request a limit increase, contact support. If you are a team owner or resource modifier, you can check your resource limits and request an increase on the Resource Limits page in the DigitalOcean Control Panel.
A running batch job does not consume your real-time API quota or degrade latency for your production applications.
-
Batch traffic is isolated from real-time traffic. Batch jobs run at lower scheduling priority and share off-peak GPU capacity. Batch scheduling does not degrade real-time inference p99 latency by more than 5%.
-
Additional limits apply based on your security tier.
Model Evaluations
-
Model evaluation datasets have the following limits:
- Each dataset must have less than 1000 rows
- Each dataset must be less than 1GB in size regardless of the customer tier
-
Lower tier customers do have access to commercial models from Anthropic or OpenAI for evaluation or judging.