Give Feedback

How to Use Multimodal Inference

Validated on 5 May 2026 • Last edited on 5 May 2026

Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.

Copy page as Markdown View page as Markdown

Multimodal models process and generate content across multiple data types, including images, audio, video, and text. You can use Vision-Language Models (VLMs), generate text-to-speech, or videos from text prompts using these models.

Use Vision-Language Models (VLMs)

VLMs accept text and image (or video) inputs and return text outputs. Use them for visual question answering, image summarization, document understanding, and multimodal reasoning. Supported image formats are PNG, JPG, JPEG, and WEBP.

You can provide images as base64-encoded data URIs or HTTPS URLs. For streaming, use stream: true (supported via server-sent events (SSE)).

For Kimi K2.5 models:

Supports up to 256K context windows and up to 1,500 tool calls.
Control reasoning traces using the system prompt:
- /think: Enables extended reasoning; response includes a reasoning_content field. We recommend setting temperature to 1.0.
- /no_think: Disables reasoning for faster, direct answers. We recommend setting temperature to 0.6.

For Kimi K2.6 (kimi-k2.6):

Input context window up to 262,144 tokens.
Max output up to 262,144 tokens.
Text and image inputs with text outputs.
Supports tool calling and streaming.
Returns extended reasoning in reasoning_content for every response. The /think and /no_think system-prompt controls supported by Kimi K2.5 don’t change this on Kimi K2.6.
To get a concise final answer in content, use a system prompt such as Respond with only the final answer. The model still returns reasoning_content, but content is shorter.
For agent deployments, default max output is 4,096 tokens (maximum 8,192, minimum 1).

cURL

The following example request sends a prompt to a Nemotron Nano 12B v2 VL model asking a question about a provided image. Images can be provided as base64-encoded data URIs or HTTPS URLs. In this example, the image URL is provided in the prompt.

curl https://inference.do-ai.run/v1/chat/completions \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron-nano-12b-v2-vl",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is shown in this image?"},
          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."}}
        ]
      }
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

PyDo

Image Q&A

import base64

with open("image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="nemotron-nano-12b-v2-vl",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {"type": "image_url",
                 "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}}
            ]
        }
    ]
)
print(response.choices[0].message.content)

Generate Images Using the Image Generation Endpoint

Generate or edit images from text prompts using Stable Diffusion 3.5 Large model. Supported output resolutions up to 1 megapixel (for example, 1024×1024).

cURL

curl https://inference.do-ai.run/v1/images/generations \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "stable-diffusion-3.5-large",
    "prompt": "A sunset over mountains",
    "n": 1,
    "size": "1024x1024",
    "quality": "auto",
    "response_format": "b64_json",
    "background": "auto",
    "output_format": "png"
  }'

The response looks similar to the following:

{
  "created": 1710700000,
  "data": [
    {
      "b64_json": "<base64-encoded image>"
    }
  ],
  "usage": {
    "total_tokens": 1
  }
}

PyDo

import base64

response = client.images.generate(
    model="stable-diffusion-3.5-large",
    prompt="A sunset over mountains",
    n=1,
    size="1024x1024",
    quality="auto",
    response_format="b64_json"
)

image_bytes = base64.b64decode(response.data[0].b64_json)
with open("output.png", "wb") as f:
    f.write(image_bytes)

Generate Text-to-Speech Using the Text-to-Speech Endpoint

Convert text to natural-sounding speech. Audio data is returned as base64-encoded JSON (b64_json) in WAV or MP3 format. The response is always wrapped in the standard data envelope regardless of the requested format.

cURL

TTS Using Qwen 3 TTS Model

The following example sends a prompt to the Qwen 3 TTS model with the input string to speak, using voice of Alloy, and response format in .mp3.

curl -sS https://inference.do-ai.run/v1/audio/speech \
 -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
 -H "Content-Type: application/json" \
 -d '{
  "model": "qwen3-tts-voicedesign",
  "input": "Hello, world!",
  "voice": "alloy",
  "response_format": "mp3",
  "instructions": "Speak naturally."
 }'

PyDo

import base64

response = client.audio.speech.create(
    model="qwen3-tts-voicedesign",
    input="Hello world!",
    voice="alloy",
    response_format="mp3",
    instructions="Speak naturally."
)

audio_bytes = base64.b64decode(response.data[0].b64_json)
with open("output.mp3", "wb") as f:
    f.write(audio_bytes)

Generate Video From Text Using the Video Generation Endpoint

Generate short video clips from text prompts. Video generation is an asynchronous operation. You submit a job and poll for the result. The output video is in MP4 format, returned as binary content via the content endpoint, or as a presigned URL in the completed status response. Video generation can take anywhere from 30 seconds to several minutes.

cURL

The following example sends a request to generate a video using the Wan 2.2 T2V A14B model. First, submit the job:

curl -X POST https://inference.do-ai.run/v1/videos \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "wan2.2-t2v-a14b",
    "prompt": "Two anthropomorphic cats in boxing gear fight intensely on a spotlighted stage",
    "size": "1280x720",
    "fps": 16
  }'

The request returns a job ID and job status:

{
  "id": "job_abc123",
  "status": "processing"
}

Next, poll for the result using the job ID:

curl https://inference.do-ai.run/v1/video/generations/job_abc123 \
  -H "Authorization: Bearer $MODEL_ACCESS_KEY"

You see the following when the job completes:

{
  "created_at": 1777003604,
  "error": null,
  "id": "video_bGl0ZWxsbTpjdXN0b21fbGxtX3Byb3ZpZGVyOm9wZW5haTttb2RlbF9pZDo7dmlkZW9faWQ6ZGExYjFmYzMtYTM3NC00MmM0LTgxMWMtOWE5NmE3ZWY0OTIz",
  "model": "wan2.2-t2v-a14b",
  "object": "video",
  "output": null,
  "status": "completed",
  "x_request_id": null
}

For large files, you can also fetch the binary MP4 directly using GET /v1/videos/{id}/content, which returns Content-Type: video/mp4.

PyDo

import time
import requests

job = client.video.generations.create(
    model="wan2.2-t2v-a14b",
    prompt="Two anthropomorphic cats in boxing gear fight intensely on a spotlighted stage",
    size="1280x720"
)

# Poll until complete
while True:
    result = client.video.generations.retrieve(job.job_id)
    if result.status == "completed":
        with open("output.mp4", "wb") as f:
            f.write(requests.get(result.output.url).content)
        break
    elif result.status == "failed":
        raise Exception("Video generation failed")
    time.sleep(5)

How to Use Multimodal Inference

Use Vision-Language Models (VLMs)

Image Q&A

Generate Images Using the Image Generation Endpoint

Generate Text-to-Speech Using the Text-to-Speech Endpoint

TTS Using Qwen 3 TTS Model

Generate Video From Text Using the Video Generation Endpoint

We can't find any results for your search.