# DigitalOcean Gradient™ AI Inference Hub Features

DigitalOcean Gradient™ AI Inference Hub provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare capabilities and pricing, and run inference using serverless or dedicated deployments. DigitalOcean Gradient AI Inference Hub is in [public preview](https://docs.digitalocean.com/platform/product-lifecycle/index.html.md#public-preview) and enabled for all users. You can [contact support](https://cloudsupport.digitalocean.com) for questions or assistance.

## Models

- **Model Catalog:** Browse available foundation models in Model Catalog. You can view model capabilities and pricing, filter models by provider, classification, context window, and other attributes, open models in the Model Playground for testing, run inference using serverless deployments, or deploy models with dedicated infrastructure for production workloads.

  For more information about using models and serverless inference, see the [Models](https://docs.digitalocean.com/products/gradient-ai-platform/details/models/index.html.md) section.
- **Model Playground:** Test and compare model performance in a web-based interface. You can adjust settings like temperature and token limits, evaluate model responses, and fine-tune how your agents behave.

## Serverless Inference

Send API requests directly to foundation models without creating or managing an agent. Serverless inference runs requests immediately using your model access key and model ID with no need to define instructions or context ahead of time.

### Prompt Caching

Use prompt caching with the serverless inference [chat completion](https://docs.digitalocean.com/products/inference-hub/how-to/use-serverless-inference-deployments/index.html.md#chat-completions-api) and [responses](https://docs.digitalocean.com/products/inference-hub/how-to/use-serverless-inference-deployments/index.html.md#responses-api) APIs to cache context and use it in future requests. If part of your request is already cached, you are charged a lower price for those cached tokens, and the [standard price](https://docs.digitalocean.com/products/inference-hub/details/pricing/index.html.md) for the remaining input tokens. This significantly reduces the cost for inference.

For a list of models we support prompt caching for, see [Foundation Models](https://docs.digitalocean.com/products/gradient-ai-platform/details/models/index.html.md#foundation-models). For examples, see [Use Prompt Caching](https://docs.digitalocean.com/products/inference-hub/how-to/use-serverless-inference-deployments/index.html.md#use-prompt-caching) in the serverless inference documentation.

## Dedicated Inference (public)

Dedicated Inference is a managed inference service that lets you host and scale open-source and commercial LLMs on dedicated GPUs and deploy your model as an inference endpoint. You can choose the GPU type per endpoint and scale them up or down by setting the desired node count, including scaling down GPU replicas to zero to prevent idle GPU time.

Dedicated Inference is a Kubernetes-native product where we manage ingress networking, RDMA for multi-node model serving, model storage, model serving engine (vLLM), single- and multi-node disaggregation, software components for autoscaling, prefix-aware routing, and parallelism.

Dedicated Inference is available in [public preview](https://docs.digitalocean.com/platform/product-lifecycle/index.html.md#public-preview) and enabled for all users. You can [contact support](https://cloudsupport.digitalocean.com) for questions or assistance.