# How to Use Dedicated Inference on DigitalOcean Gradient™ AI Platform (public) DigitalOcean Gradient™ AI Platform lets you build fully-managed AI agents with knowledge bases for retrieval-augmented generation, multi-agent routing, guardrails, and more, or use serverless inference to make direct requests to popular foundation models. Dedicated Inference is available in [public preview](https://docs.digitalocean.com/platform/product-lifecycle/index.html.md#public-preview) and enabled for all users. You can [contact support](https://cloudsupport.digitalocean.com) for questions or assistance. You can create a dedicated inference deployment using the DigitalOcean [API](#create-dedicated-inference-using-automation) and [Control Panel](#create-dedicated-inference-using-the-control-panel). Dedicated Inference is available in `ATL1`, `NYC2` and `TOR1` only. ## When to Use Dedicated Inference Versus Serverless Inference Dedicated Inference is a managed inference service that enables you to host and scale open-source and commercial LLMs on dedicated GPUs. It gives you more control over the environment so you can choose the GPU, tune performance, and optimize your models for throughput, latency, cost or concurrency. Dedicated inference is best suited for steady, high-throughput workloads. Serverless inference lets you send API requests directly to foundation models. Choose serverless inference over dedicated inference when you need to get started quickly without managing any components behind an inference endpoint, don't have a custom model to host or optimize, or have unpredictable or spiky inference traffic. Pricing for serverless inference is based on the number of tokens used, while pricing for dedicated inference is based on the GPU hours used. If you want to use serverless inference, see [Use Serverless Inference](https://docs.digitalocean.com/products/gradient-ai-platform/how-to/use-serverless-inference/index.html.md). ## Create Dedicated Inference Using Automation Creating a dedicated inference deployment using the API requires you to send a `POST` request to the [`/v2/dedicated-inferences`](https://docs.digitalocean.com/reference/api/digitalocean/index.html.md#tag/Dedicated-Inference/operation/dedicatedInferences_create) endpoint with the following parameters in the request JSON body: - `name`: An inference name. - `region`: Datacenter region. - `hugging_face_token`: Hugging Face token for the model to be deployed. Required only for [gated models](https://huggingface.co/docs/hub/en/models-gated). - `model_slug`: Model slug for the model to be deployed. Model slugs are same as the slugs present on Hugging Face. - `gpu_plan_slug`: Slug for the GPU plan to deploy the model on. Use the [`/v2/dedicated-inferences/sizes`](https://docs.digitalocean.com/reference/api/digitalocean/index.html.md#tag/Dedicated-Inference/operation/dedicatedInferences_list_sizes) endpoint to view available GPU plans and their slugs. - `node_count`: Number of GPU nodes. ## How to Create a Dedicated Inference Deployment Using the DigitalOcean API 1. [Create a personal access token](https://docs.digitalocean.com/reference/api/create-personal-access-token/index.html.md) and save it for use with the API. 2. Send a POST request to [`https://api.digitalocean.com/v2/dedicated-inferences`](https://docs.digitalocean.com/reference/api/digitalocean//index.html.md#operation/dedicatedInferences_create). ### cURL Using cURL: ```shell curl -i -X POST "https://api.digitalocean.com/v2/dedicated-inferences" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $DO_TOKEN" \ -d '{ "spec": { "version": 1, "name": "new-dedicated-inference", "region": "atl1", "vpc": { "uuid": "7e5c619c-359c-44ca-87e2-47e98170c012" }, "enable_public_endpoint": true, "model_deployments": [{ "model_slug": "mistral/mistral-7b-instruct-v3", "model_provider": "hugging_face", "workload_config": {}, "accelerators": [{ "scale": 2, "type": "prefill_decode", "accelerator_slug": "gpu-mi300x1-192gb" }] }] }, "access_tokens": { "hugging_face_token": "$HF_TOKEN" } }' ``` In the response body, the `status` field shows `provisioning` when the dedicated inference deployment is being provisioned and updates to `active` when provisioning is complete. If GPU capacity is not available, we attempt to provision the cluster for one to two hours, after which the `status` shows `error`. To list all your dedicated inference deployments, send a `GET` request to the [`/v2/dedicated-inferences` endpoint](https://docs.digitalocean.com/reference/api/digitalocean/index.html.md#tag/Dedicated-Inference/operation/dedicatedInferences_list). ## How to List Dedicated Inference Deployments Using the DigitalOcean API 1. [Create a personal access token](https://docs.digitalocean.com/reference/api/create-personal-access-token/index.html.md) and save it for use with the API. 2. Send a GET request to [`https://api.digitalocean.com/v2/dedicated-inferences`](https://docs.digitalocean.com/reference/api/digitalocean//index.html.md#operation/dedicatedInferences_list). ### cURL Using cURL: ```shell curl -i -X GET "https://api.digitalocean.com/v2/dedicated-inferences?region=nyc2&page=1&per_page=20" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $DO_TOKEN" ``` ## Create Dedicated Inference Using the Control Panel To create a dedicated inference deployment from the [DigitalOcean Control Panel](https://cloud.digitalocean.com), click **Agent Platform** in the left menu. Select the **Dedicated Inference** tab and click **Deploy Dedicated Inference** to open the **Create an Inference Deployment** page. ### Choose a Datacenter Region Select the region for your dedicated inference deployment. All resources created in this datacenter are members of the same VPC network. ### Select a Pre-trained Model You can use a pre-trained model hosted on Hugging Face, such as DeepSeek, OpenAI, or Llama models. Click **Details** next to each model to learn more about the model and its capabilities. Some models are gated and require you to request access on Hugging Face, as described in [their documentation](https://huggingface.co/docs/hub/en/models-gated). If you select a gated model, make sure you have been granted access before creating your dedicated inference deployment. Then, provide a Hugging Face access token in the **HuggingFace Access Token** field. To learn how to create an access token, see the Hugging Face documentation on [access tokens](https://huggingface.co/docs/hub/en/security-tokens). ### GPU Plan Select a GPU plan to deploy your dedicated inference. The GPU plan determines the hardware resources allocated to your deployment. You can choose from 1- or 8-GPU AMD or NVIDIA GPU plans. Next, specify the number of GPU nodes. ### Finalize Specify a name for your dedicated inference deployment. Names must be lowercase and can only contain letters, numbers, and hyphens. In the **Summary** section, review the cost based on the selections you made. Then, click **Deploy Dedicated Inference** to create your deployment. It may take several minutes for the deployment to be provisioned. The page lists all the dedicated inference instances that you have in your team. Here, you can view details such as the name, status, public endpoint and when the dedicated inference was created. The **Status** column shows **Provisioning** when the Dedicated inference is provisioning and updates to **Active** when the provisioning is complete. If GPU capacity is not available, we try to provision the cluster for 1-2 hr and then you see an **Error** status. ![The dedicated inference page showing the deployments along with their status and creation time.](https://docs.digitalocean.com/screenshots/gradient-ai-platform/dedicated-inference-page.b3148898c6fbcbc0ada02cb92cc9f29b526d237fb8a514e7289c36c18bb693a9.png) Once the deployment is active, you can view its details and use the endpoint to send inference requests. Make sure to note the access token, which is visible only once immediately after creation. Copy and store it securely, as you will need it to interact with the deployment endpoint. ## Use the Dedicated Inference Endpoint Dedicated inference deployments have two types of endpoints: - **Private Endpoint**: Use when you want to allow resources within the VPC network the deployment belongs to access the endpoint. - **Public Endpoint**: Use when you want to allow external sources to access the endpoint. Using dedicated inference endpoints requires an access token to authenticate your requests. We automatically create an access token when provisioning your dedicated inference deployment. The same token can be used to authenticate requests to both the public and private endpoints. If you create the deployment using the Control Panel, the token is visible only once immediately after creation, so make sure to copy and store it securely. You can also send a `GET` request to the [`/v2/dedicated-inferences/{dedicated_inference_id}/tokens` endpoint](https://docs.digitalocean.com/reference/api/digitalocean/index.html.md#tag/Dedicated-Inference/operation/dedicatedInferences_list_tokens) to list the current tokens. If you want to create additional access tokens, in the Control Panel, go to the **Endpoint Access Keys** section of the **Settings** tab of your deployment, and click **Create Access Key** to open the **Create Endpoint Access Key** window. Provide a name for your access key and click **Create Access Key**. Copy the access token to use it later in your HTTP requests. You can use the [`/v2/dedicated-inferences/{dedicated_inference_id}/tokens` endpoint](https://docs.digitalocean.com/reference/api/digitalocean/index.html.md#tag/Dedicated-Inference/operation/dedicatedInferences_create_tokens). ## Using the Control Panel Click the inference deployment to go to its **Overview** page. Here, you can view the private and public endpoints for the deployment, and use cURL to get details for the dedicated inference and interact with chat endpoints. You can view the private endpoint in the **Private Endpoint (VPC Network)** field. To use the private endpoint, click **Download CA Certificate** to download the CA certificate and note the path to where you saved the certificate. Then, copy the cURL request from **Configuration Example**, update the certificate path, and provide the access token value in the request to interact with the private endpoint. For example: ```curl curl --cacert --location 'https://yeb78cdp0s5ux27pam0g1gg3-private-dedicated-inference.do-infra.ai/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer ' \ -d '{ "model": "openai/gpt-oss-120b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ], "max_tokens": 150 }' ``` You can view the public endpoint in the **Public Endpoint** field. To use the public endpoint, copy the cURL request in **Configuration Example**, and provide the access token value in the request to interact with the public endpoint. For example: ```curl curl --location 'https://yeb78cdp0s5ux27pam0g1gg3-public-dedicated-inference.do-infra.ai/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer ' \ -d '{ "model": "openai/gpt-oss-120b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ], "max_tokens": 150 }' ``` ## Using the API You can use the below request to access to interact with the public endpoint of your dedicated inference. You need the token value, model slug, and public endpoint from the response body [when you create the dedicated inference](#create-dedicated-inference-using-automation). ```curl curl --location '$DEDICATED_INFERENCE_PUBLIC_ENDPOINT/v1/chat/completions' \ --header 'accept: application/json' \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer $AUTH_TOKEN' \ --data '{ "model": "$MODEL_SLUG", "messages": [ { "role": "user", "content": "Explain the concept of recursion in programming." } ], "max_tokens": 150 } ' ``` Use the following request to access private connectivity with your dedicated inference using the private endpoint. You need to use the CA certificate in your call, which you can get by sending a `GET` request to the [`/v2/dedicated-inferences//ca` endpoint](https://docs.digitalocean.com/reference/api/digitalocean/index.html.md#tag/Dedicated-Inference/operation/dedicatedInferences_get_ca). You also need the token value, model slug and public endpoint from the response body [when you create the dedicated inference](#create-dedicated-inference-using-automation). ```curl curl --cacert $CERT_FILE_PATH \ --location '$DEDICATED_INFERENCE_PRIVATE_ENDPOINT/v1/chat/completions' \ --header 'accept: application/json' \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer $AUTH_TOKEN' \ --data '{ "model": "$MODEL_SLUG", "messages": [ { "role": "user", "content": "Explain the concept of recursion in programming." } ], "max_tokens": 150 } ' ``` ## View or Update Dedicated Inference Settings ## Using the Control Panel To view the inference settings, click the **…** menu to the right of your deployment and select **View settings**. On the **Settings** page, you can view deployment information such as the GPU plan, and node count. You can also update the dedicated inference resource settings in the **Settings** tab. ## Using the API Viewing or updating a dedicated inference deployment using the DigitalOcean API requires the unique identifier of the deployment. You can get a list of agents with their unique identifiers using the [`/v2/dedicated-inference` endpoint](https://docs.digitalocean.com/reference/api/digitalocean/index.html.md#tag/Dedicated-Inference/operation/dedicatedInferences_list). ## How to View Settings of a Dedicated Inference Deployment Using the DigitalOcean API 1. [Create a personal access token](https://docs.digitalocean.com/reference/api/create-personal-access-token/index.html.md) and save it for use with the API. 2. Send a GET request to [`https://api.digitalocean.com/v2/dedicated-inferences/{dedicated_inference_id}`](https://docs.digitalocean.com/reference/api/digitalocean//index.html.md#operation/dedicatedInferences_get). ### cURL Using cURL: ```shell curl -i -X GET "https://api.digitalocean.com/v2/dedicated-inferences/6b5c619c-359c-44ca-87e2-47e98170c01d" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $DO_TOKEN" ``` ## How to Update Settings of a Dedicated Inference Deployment Using the DigitalOcean API 1. [Create a personal access token](https://docs.digitalocean.com/reference/api/create-personal-access-token/index.html.md) and save it for use with the API. 2. Send a PATCH request to [`https://api.digitalocean.com/v2/dedicated-inferences/{dedicated_inference_id}`](https://docs.digitalocean.com/reference/api/digitalocean//index.html.md#operation/dedicatedInferences_patch). ### cURL Using cURL: ```shell curl -i -X PATCH "https://api.digitalocean.com/v2/dedicated-inferences/6b5c619c-359c-44ca-87e2-47e98170c01d" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $DO_TOKEN" \ -d '{ "spec": { "name": "renamed-dedicated-inference", "region": "atl1", "vpc": { "uuid": "997615ce-132d-4bae-9270-9ee21b395e5d" }, "model_deployments": [{ "model_slug": "mistral/mistral-7b-instruct-v3", "accelerator_slug": "gpu-mi300x1-192gb", "node_count": 3 }] }, "access_tokens": { "hugging_face_token": "$HF_TOKEN" } }' ``` ## Destroy Dedicated Inference Deployment ## Using the Control Panel To destroy a dedicated inference deployment, click the **…** menu for the deployment that you want to destroy and select **Destroy**. In the **Delete Dedicated Inference** window, type the name of the deployment, and then click **Delete**. ## Using the API Destroying a dedicated inference deployment using the DigitalOcean API requires the unique identifier of the deployment you want to destroy. You can get a list of deployments with their unique identifiers with the [`/v2/dedicated-inference` endpoint](https://docs.digitalocean.com/reference/api/digitalocean/index.html.md#tag/Dedicated-Inference/operation/dedicatedInferences_list). ## How to Destroy a Dedicated Inference Deployment Using the DigitalOcean API 1. [Create a personal access token](https://docs.digitalocean.com/reference/api/create-personal-access-token/index.html.md) and save it for use with the API. 2. Send a DELETE request to [`https://api.digitalocean.com/v2/dedicated-inferences/{dedicated_inference_id}`](https://docs.digitalocean.com/reference/api/digitalocean//index.html.md#operation/dedicatedInferences_delete). ### cURL Using cURL: ```shell curl -i -X DELETE "https://api.digitalocean.com/v2/dedicated-inferences/6b5c619c-359c-44ca-87e2-47e98170c01d" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $DO_TOKEN" ```