1-Click Inference Ready - Single GPU
Generated on 22 Jul 2025 from the 1-Click Inference Ready - Single GPU catalog page
Inference Optimized image simplifies the setup and deployment of large language models (LLMs) by leveraging Docker and vLLM, with built-in support for Hugging Face model downloads, speculative decoding, prompt caching and multi-model concurrency.
Users can configure the system to run one, two, or four models concurrently, each with customizable tensor parallelism settings to optimize hardware utilization. If a model isn’t already cached locally, it is automatically downloaded and stored for future use.
The image also includes special handling for FP8 quantization, enabling efficient low-precision inference. Speculative decoding is fully supported, including the use of draft models to enhance performance.
This makes it a powerful, plug-and-play solution for scalable and optimized LLM inference.
Models Available (H100x8)
Single Model Mode (8 GPUs)
meta-llama/Llama-3.1-8B-Instruct
meta-llama/Llama-3.3-70B-Instruct
meta-llama/Llama-4-Scout-17B-16E-Instruct
deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- Any custom model available on Hugging Face (manual specification)
Two Models Concurrent Mode (4 GPUs Each)
meta-llama/Llama-3.1-8B-Instruct
meta-llama/Llama-3.3-70B-Instruct
deepseek-ai/DeepSeek-R1-Distill-Llama-70B
Four Models Concurrent Mode (2 GPUs Each)
meta-llama/Llama-3.1-8B-Instruct
meta-llama/Llama-3.3-70B-Instruct-FP8
deepseek-ai/DeepSeek-R1-Distill-Llama-70B-FP8
meta-llama/Llama-3.3-70B-Instruct-FP8-Speculative-Decoding
⚠️ Licenses: Please check the respective model licenses on Hugging Face.
Many models require license acceptance before running the automation script.
Models Available (Other Nvidia GPUs)
RTX 4000 ADA
meta-llama/Llama-3.1-8B-FP8
H100, L40S, or RTX6000 ADA
meta-llama/Llama-3.1-8B
meta-llama/Llama-3.1-8B-FP8
mistralai/Mistral-Nemo-Instruct-2407
mistralai/Mistral-Nemo-Instruct-2407-FP8
Hardware Support
- Only supports DigitalOcean GPU Droplets with Nvidia GPUs
- Supports both single-GPU and multi-GPU configurations
HuggingFace Token
- A HuggingFace Token with Read access is required for downloading the models. Note that some models require applications on HuggingFace website to be downloaded.
- The Hugging Face tokens are stored at /.cache/huggingface directory, under the stored_tokens and token file. Please remove them prior to sharing your droplet to others.
Software Included
Package | Version | License |
---|---|---|
Ubuntu | 24.04 LTS | GPLv3 |
CUDA Toolkit | 12.9 | NVIDIA EULA |
NVIDIA Driver | 575.51.03 | NVIDIA EULA |
vLLM (vllm-openai container) | 0.9.0 | Apache 2.0 |
Creating an App using the Control Panel
Click the Deploy to DigitalOcean button to create a Droplet based on this 1-Click App. If you aren’t logged in, this link will prompt you to log in with your DigitalOcean account.
Creating an App using the API
In addition to creating a Droplet from the 1-Click Inference Ready - Single GPU 1-Click App using the control panel, you can also use the DigitalOcean API. As an example, to create a 4GB 1-Click Inference Ready - Single GPU Droplet in the SFO2 region, you can use the following curl
command. You need to either save your API access token) to an environment variable or substitute it in the command below.
curl -X POST -H 'Content-Type: application/json' \
-H 'Authorization: Bearer '$TOKEN'' -d \
'{"name":"choose_a_name","region":"sfo2","size":"s-2vcpu-4gb","image": "digitaloceanai-1clickinferencer"}' \
"https://api.digitalocean.com/v2/droplets"
Getting Started After Deploying 1-Click Inference Ready - Single GPU
Getting started after deploying the Inference Optimized image
- Access the Droplet Console
-
Navigate to the GPU Droplets page.
-
Locate your newly created Droplet and click on its name.
-
At the top of your screen, select and launch the Web Console.
-
Note: For H100x8, it may take an additional 5 minutes for the machine to become fully operational after it has been activated.
-
Login via SSH
-
If you selected an SSH key during droplet creation:
- Open your preferred SSH client (e.g., PuTTY, Terminal).
-
Use the droplet’s public IP address to log in as root:
ssh root@your_droplet_public_IP
- Ensure your SSH key is added to the SSH agent, or specify the key file directly:
ssh -i /path/to/your/private_key root@your_droplet_public_IP
-
Once connected, you will be logged in as the root user without needing a password.
-
Execute the Inference Optimized Script
- Run the following command:
bash run_model.sh
-
If it’s your first time, input your Hugging Face token from https://huggingface.co/settings/tokens.
-
You’ll be presented with the following options (for H100x8):
- Option 1: Single model (8 GPUs, TP=8)
- Option 2: Two models concurrently (4 GPUs each, TP=4)
- Option 3: Four models concurrently (2 GPUs each, TP=2)
-
For non-H100x8, available models will depend on GPU type.
-
Select a configuration option and follow the prompts to choose models.
If Option 1 is selected:
Select a model to download and run:
[0] meta-llama/Llama-3.1-8B-Instruct
[1] meta-llama/Llama-3.3-70B-Instruct
[2] meta-llama/Llama-4-Scout-17B-16E-Instruct
[3] deepseek-ai/DeepSeek-R1-Distill-Llama-70B
[C] Enter custom model name
If Option 2 is selected:
Select TWO models to run concurrently (4 GPUs each):
[0] meta-llama/Llama-3.1-8B-Instruct
[1] meta-llama/Llama-3.3-70B-Instruct
[2] deepseek-ai/DeepSeek-R1-Distill-Llama-70B
If Option 3 is selected:
Select FOUR models to run concurrently (2 GPUs each):
Note: You can select the same model multiple times for data parallelism!
[0] meta-llama/Llama-3.1-8B-Instruct
[1] meta-llama/Llama-3.3-70B-Instruct-FP8
[2] deepseek-ai/DeepSeek-R1-Distill-Llama-70B-FP8
[3] meta-llama/Llama-3.3-70B-Instruct-FP8-Speculative-Decoding
For single GPU (example: RTX 6000 ADA), you will see:
[INFO] Detected GPU: NVIDIA RTX 6000 Ada Generation
[INFO] GPU Type: RTX6000ADA
[INFO] RTX6000ADA detected - All models are available
Select a model to download and run:
[0] meta-llama/Llama-3.1-8B
[1] meta-llama/Llama-3.1-8B-FP8
[2] mistralai/Mistral-Nemo-Instruct-2407
[3] mistralai/Mistral-Nemo-Instruct-2407-FP8
[C] Enter custom model name
Note: For Options 2 and 3, you can select the same model multiple times to achieve data parallelism.
Sample Option 3 Walkthrough
Choose concurrent model configuration:
[1] Single model (8 GPUs, tensor-parallel-size=8)
[2] Two models concurrent (4 GPUs each, tensor-parallel-size=4)
[3] Four models concurrent (2 GPUs each, tensor-parallel-size=2)
Enter your choice (1-3): 3
Select FOUR models to run concurrently (2 GPUs each):
Enter model choice for slot 1 (0-3): 0
Slot 1: meta-llama/Llama-3.1-8B-Instruct
Enter model choice for slot 2 (0-3): 1
Slot 2: meta-llama/Llama-3.3-70B-Instruct-FP8
Enter model choice for slot 3 (0-3): 2
Slot 3: deepseek-ai/DeepSeek-R1-Distill-Llama-70B-FP8
Enter model choice for slot 4 (0-3): 3
Slot 4: meta-llama/Llama-3.3-70B-Instruct-FP8-Speculative-Decoding
- Check Deployment Status
A summary appears as models are downloaded and vLLM containers are created:
Waiting for all model APIs to become ready. It is expected to take a few minutes.
[meta-llama/Llama-3.1-8B-Instruct on port 8000] Waiting for API to be ready... Retry in 10 seconds.
...
Deployment on port 8000 is ✅ Ready!
When deployment is ready, you’ll see:
Running Models and Ports
Model 1: meta-llama/Llama-3.1-8B-Instruct
Container: vllm
Port: 8000
CUDA Devices: 0,1
API Endpoint: http://localhost:8000/v1
---
Model 2: meta-llama/Llama-3.3-70B-Instruct-FP8
Container: vllm2
Port: 8001
CUDA Devices: 2,3
API Endpoint: http://localhost:8001/v1
---
Model 3: deepseek-ai/DeepSeek-R1-Distill-Llama-70B-FP8
Container: vllm3
Port: 8002
CUDA Devices: 4,5
API Endpoint: http://localhost:8002/v1
---
Model 4: meta-llama/Llama-3.3-70B-Instruct-FP8-Speculative-Decoding
Container: vllm4
Port: 8003
CUDA Devices: 6,7
API Endpoint: http://localhost:8003/v1
View logs:
docker exec vllm tail -n 100 /var/log/vllm.log
docker exec vllm2 tail -n 100 /var/log/vllm.log
docker exec vllm3 tail -n 100 /var/log/vllm.log
docker exec vllm4 tail -n 100 /var/log/vllm.log
Test APIs:
curl http://localhost:8000/v1/models
curl http://localhost:8001/v1/models
curl http://localhost:8002/v1/models
curl http://localhost:8003/v1/models
At this point
The models are now running. You can verify functionality via logs or API calls.
- Troubleshooting
- If containers are running, check logs:
docker exec vllm tail -n 100 /var/log/vllm.log
docker exec vllm2 tail -n 100 /var/log/vllm.log
docker exec vllm3 tail -n 100 /var/log/vllm.log
docker exec vllm4 tail -n 100 /var/log/vllm.log
-
Nvidia driver upgrades are held to avoid version mismatches.
- However, Ubuntu unattended upgrades may cause CUDA errors.
- If errors occur, run:
sudo reboot bash run_model.sh
- View unattended upgrade logs at:
/var/log/apt/history.log
-
If you don’t have access to a Hugging Face model, you’ll see:
Access to model mistralai/Mistral-Nemo-Instruct-2407 is restricted and you are not in the authorized list. Visit https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 to ask for access.
- For Option 1 with custom models, ensure they fit into H100x8 GPU memory. Most models fit, except very large ones (e.g.,
deepseek-ai/DeepSeek-R1
). - Hugging Face cache info:
/.cache/huggingface
- Models are under
hub/
- Tokens are stored in
stored_tokens
and/ortoken
file - Delete these if you wish to switch Hugging Face tokens