Give Feedback

How to Use Prompt Caching in Chat Completions and Responses API

Validated on 10 Apr 2026 • Last edited on 16 Apr 2026

DigitalOcean Gradient™ AI Inference Hub provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare capabilities and pricing, and run inference using serverless or dedicated deployments. DigitalOcean Gradient AI Inference Hub is in private preview. You can contact support for questions or assistance.

Copy page as Markdown View page as Markdown

Use prompt caching with the chat completion and responses APIs to cache context and use it in future requests. If part of your request is already cached, you are charged a lower price for those cached tokens, and the standard price for the remaining input tokens. This significantly reduces the cost for inference.

Anthropic Models

Use prompt caching for Anthropic models in the chat completions API. Specify the cache_control parameter with type: ephemeral and ttl in your JSON request body. The ttl value can be 5m (default) or 1h. The following request body examples show how to use the cache_control parameter.

Single content part with cache control

...
{
  "role": "user",
  "content": {
    "type": "text",
    "text": "This is cached for 1h.",
    "cache_control": {
      "type": "ephemeral",
      "ttl": "1h"
    }
  }
}

Array of content parts (mixed cached and non-cached request)

...
{
  "role": "developer",
  "content": [
    {
      "type": "text",
      "text": "Cache this segment for 5 minutes.",
      "cache_control": {
        "type": "ephemeral",
        "ttl": "5m"
      }
    },
    {
      "type": "text",
      "text": "Do not cache this segment"
    }
  ]
}

Tool message content with cache control

...
{
  "role": "tool",
  "tool_call_id": "tool_call_id",
  "content": [
    {
      "type": "text",
      "text": "Tool output cached for 5m.",
      "cache_control": {
        "type": "ephemeral",
        "ttl": "5m"
      }
    }
  ]
}

The JSON response looks similar to the following and shows the number of input tokens cached during this request:

    "usage": {
        "cache_created_input_tokens": 1043,
        "cache_creation": {
            "ephemeral_1h_input_tokens": 0,
            "ephemeral_5m_input_tokens": 1043
        },
        "cache_read_input_tokens": 0,
        "completion_tokens": 100,
        "prompt_tokens": 14,
        "total_tokens": 114
    }

If you send the request again, cached input tokens are used and the response looks like this:

    "usage": {
        "cache_created_input_tokens": 0,
        "cache_creation": {
            "ephemeral_1h_input_tokens": 0,
            "ephemeral_5m_input_tokens": 0
        },
        "cache_read_input_tokens": 1043,
        "completion_tokens": 100,
        "prompt_tokens": 14,
        "total_tokens": 114
    }

OpenAI Models

Use [prompt caching]CONDITIONAL-GRADIENT-AI-PLATFORM-START(/products/gradient-ai-platform/details/features/#prompt-caching)CONDITIONAL-GRADIENT-AI-PLATFORM-ENDCONDITIONAL-INFERENCE-HUB-START(/products/inference-hub/details/features/#prompt-caching)CONDITIONAL-INFERENCE-HUB-END for OpenAI models for prompts containing 1024 tokens or more in both chat completions and responses API. Caching applies when the input tokens of a response match tokens from a previous response, though this is best-effort and not guaranteed.

To use prompt caching, specify the prompt_cache_retention parameter as either in_memory or 24h. The following request body example shows how to use the prompt_cache_retention parameter:

...
{
  "model": "gpt-4o-mini",
  "prompt_cache_retention": "24h",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant that summarizes text."
    },
    {
      "role": "user",
      "content": "Summarize the following text:\n\nArtificial intelligence is transforming industries by automating tasks, improving efficiency, and enabling new innovations..."
    }
  ],
  "temperature": 0.2
}

The JSON response looks similar to the following and shows the number of input tokens cached during this request:

{
  "id": "chatcmpl-xyz789",
  "object": "chat.completion",
  "created": 1772134300,
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "Artificial intelligence is reshaping industries by automating processes, increasing efficiency, and enabling innovation."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 1200,
    "completion_tokens": 35,
    "total_tokens": 1235,
    "cache_read_input_tokens": 0,
    "cache_created_input_tokens": 1200,
    "cache_creation": {
      "ephemeral_5m_input_tokens": 0,
      "ephemeral_1h_input_tokens": 1200
    }
  }
}

If you send the request again within the retention window, cached input tokens are used and the response looks like this:

"usage": {
  "prompt_tokens": 1200,
  "completion_tokens": 34,
  "total_tokens": 1234,
  "cache_read_input_tokens": 1200,
  "cache_created_input_tokens": 0,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 0,
    "ephemeral_1h_input_tokens": 0
  }
}

How to Use Prompt Caching in Chat Completions and Responses API

Anthropic Models

OpenAI Models

We can't find any results for your search.