How to Use Multimodal Inference
Validated on 27 Apr 2026 • Last edited on 28 Apr 2026
Inference provides a single control plane for managing inference workflows. It includes a Model Catalog where you can view available foundation models, including both DigitalOcean-hosted and third-party commercial models, compare model capabilities and pricing, use routing to match inference requests to the best-fit model, and run inference using serverless or dedicated deployments.
Multimodal models process and generate content across multiple data types, including images, audio, video, and text. You can use Vision-Language Models (VLMs), generate text-to-speech, or videos from text prompts using these models.
Use Vision-Language Models (VLMs)
VLMs accept text and image (or video) inputs and return text outputs. Use them for visual question answering, image summarization, document understanding, and multimodal reasoning. Supported image formats are PNG, JPG, JPEG, and WEBP.
You can provide images as base64-encoded data URIs or HTTPS URLs. For streaming, use stream: true (supported via server-sent events (SSE)).
For Kimi K2.5 models:
- Supports up to 256K context windows and up to 1,500 tool calls.
- Control reasoning traces using the system prompt:
/think: Enables extended reasoning; response includes areasoning_contentfield. We recommend settingtemperatureto 1.0./no_think: Disables reasoning for faster, direct answers. We recommend settingtemperatureto 0.6.
The following example request sends a prompt to a Nemotron Nano 12B v2 VL model asking a question about a provided image. Images can be provided as base64-encoded data URIs or HTTPS URLs. In this example, the image URL is provided in the prompt.
curl https://api.gradient.ai/v1/chat/completions \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron-nano-12b-v2-vl",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is shown in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."}}
]
}
],
"max_tokens": 512,
"temperature": 0.7
}'Image Q&A
import base64
with open("image.jpg", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="nemotron-nano-12b-v2-vl",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"}}
]
}
]
)
print(response.choices[0].message.content)Generate Images Using the Image Generation Endpoint
Generate or edit images from text prompts using Stable Diffusion 3.5 Large model. Supported output resolutions up to 1 megapixel (for example, 1024×1024).
curl https://api.gradient.ai/v1/images/generations \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "stable-diffusion-3.5-large",
"prompt": "A sunset over mountains",
"n": 1,
"size": "1024x1024",
"quality": "auto",
"response_format": "b64_json",
"background": "auto",
"output_format": "png"
}'The response looks similar to the following:
{
"created": 1710700000,
"data": [
{
"b64_json": "<base64-encoded image>"
}
],
"usage": {
"total_tokens": 1
}
}import base64
response = client.images.generate(
model="stable-diffusion-3.5-large",
prompt="A sunset over mountains",
n=1,
size="1024x1024",
quality="auto",
response_format="b64_json"
)
image_bytes = base64.b64decode(response.data[0].b64_json)
with open("output.png", "wb") as f:
f.write(image_bytes)Generate Text-to-Speech Using the Text-to-Speech Endpoint
Convert text to natural-sounding speech. Audio data is returned as base64-encoded JSON (b64_json) in WAV or MP3 format. The response is always wrapped in the standard data envelope regardless of the requested format.
TTS Using Qwen 3 TTS Model
The following example sends a prompt to the Qwen 3 TTS model with the input string to speak, using voice of Alloy, and response format in .mp3. The returned audio is saved as speech.mp3.
curl https://api.gradient.ai/v1/audio/speech \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-tts-voicedesign",
"input": "Hello, world!",
"voice": "alloy",
"response_format": "mp3",
"speed": 1.0
}'import base64
response = client.audio.speech.create(
model="qwen3-tts-voicedesign",
input="Hello world!",
voice="alloy",
response_format="mp3",
speed=1.0
)
audio_bytes = base64.b64decode(response.data[0].b64_json)
with open("output.mp3", "wb") as f:
f.write(audio_bytes)Generate Video From Text Using the Video Generation Endpoint
Generate short video clips from text prompts. Video generation is an asynchronous operation. You submit a job and poll for the result. The output video is in MP4 format, returned as binary content via the content endpoint, or as a presigned URL in the completed status response. Video generation can take anywhere from 30 seconds to several minutes.
The following example sends a request to generate a video using the Wan 2.2 T2V A14B model. First, submit the job:
curl -X POST https://api.gradient.ai/v1/videos \
-H "Authorization: Bearer $MODEL_ACCESS_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "wan2.2-i2v-a14b",
"prompt": "Two anthropomorphic cats in boxing gear fight intensely on a spotlighted stage",
"size": "1280x720",
"fps": 16
}'The request returns a job ID and job status:
{
"id": "job_abc123",
"status": "processing"
}Next, poll for the result using the job ID:
curl https://api.gradient.ai/v1/video/generations/job_abc123 \
-H "Authorization: Bearer $MODEL_ACCESS_KEY"You see the following when the job completes:
{
"created_at": 1777003604,
"error": null,
"id": "video_bGl0ZWxsbTpjdXN0b21fbGxtX3Byb3ZpZGVyOm9wZW5haTttb2RlbF9pZDo7dmlkZW9faWQ6ZGExYjFmYzMtYTM3NC00MmM0LTgxMWMtOWE5NmE3ZWY0OTIz",
"model": "wan2-2-t2v-a14b",
"object": "video",
"output": null,
"status": "completed",
"x_request_id": null
}For large files, you can also fetch the binary MP4 directly using GET /v1/videos/{id}/content, which returns Content-Type: video/mp4.
import time
import requests
job = client.video.generations.create(
model="wan2.2-i2v-a14b",
prompt="Two anthropomorphic cats in boxing gear fight intensely on a spotlighted stage",
size="1280x720"
)
# Poll until complete
while True:
result = client.video.generations.retrieve(job.job_id)
if result.status == "completed":
with open("output.mp4", "wb") as f:
f.write(requests.get(result.output.url).content)
break
elif result.status == "failed":
raise Exception("Video generation failed")
time.sleep(5)