1-Click Inference Ready - Single GPU

Generated on 22 Jul 2025 from the 1-Click Inference Ready - Single GPU catalog page

Inference Optimized image simplifies the setup and deployment of large language models (LLMs) by leveraging Docker and vLLM, with built-in support for Hugging Face model downloads, speculative decoding, prompt caching and multi-model concurrency.

Users can configure the system to run one, two, or four models concurrently, each with customizable tensor parallelism settings to optimize hardware utilization. If a model isn’t already cached locally, it is automatically downloaded and stored for future use.

The image also includes special handling for FP8 quantization, enabling efficient low-precision inference. Speculative decoding is fully supported, including the use of draft models to enhance performance.

This makes it a powerful, plug-and-play solution for scalable and optimized LLM inference.

Models Available (H100x8)

Single Model Mode (8 GPUs)

  • meta-llama/Llama-3.1-8B-Instruct
  • meta-llama/Llama-3.3-70B-Instruct
  • meta-llama/Llama-4-Scout-17B-16E-Instruct
  • deepseek-ai/DeepSeek-R1-Distill-Llama-70B
  • Any custom model available on Hugging Face (manual specification)

Two Models Concurrent Mode (4 GPUs Each)

  • meta-llama/Llama-3.1-8B-Instruct
  • meta-llama/Llama-3.3-70B-Instruct
  • deepseek-ai/DeepSeek-R1-Distill-Llama-70B

Four Models Concurrent Mode (2 GPUs Each)

  • meta-llama/Llama-3.1-8B-Instruct
  • meta-llama/Llama-3.3-70B-Instruct-FP8
  • deepseek-ai/DeepSeek-R1-Distill-Llama-70B-FP8
  • meta-llama/Llama-3.3-70B-Instruct-FP8-Speculative-Decoding

⚠️ Licenses: Please check the respective model licenses on Hugging Face.

Many models require license acceptance before running the automation script.


Models Available (Other Nvidia GPUs)

RTX 4000 ADA

  • meta-llama/Llama-3.1-8B-FP8

H100, L40S, or RTX6000 ADA

  • meta-llama/Llama-3.1-8B
  • meta-llama/Llama-3.1-8B-FP8
  • mistralai/Mistral-Nemo-Instruct-2407
  • mistralai/Mistral-Nemo-Instruct-2407-FP8

Hardware Support

  • Only supports DigitalOcean GPU Droplets with Nvidia GPUs
  • Supports both single-GPU and multi-GPU configurations

HuggingFace Token

  • A HuggingFace Token with Read access is required for downloading the models. Note that some models require applications on HuggingFace website to be downloaded.
  • The Hugging Face tokens are stored at /.cache/huggingface directory, under the stored_tokens and token file. Please remove them prior to sharing your droplet to others.

Software Included

Package Version License
Ubuntu 24.04 LTS GPLv3
CUDA Toolkit 12.9 NVIDIA EULA
NVIDIA Driver 575.51.03 NVIDIA EULA
vLLM (vllm-openai container) 0.9.0 Apache 2.0

Creating an App using the Control Panel

Click the Deploy to DigitalOcean button to create a Droplet based on this 1-Click App. If you aren’t logged in, this link will prompt you to log in with your DigitalOcean account.

Deploy to DO

Creating an App using the API

In addition to creating a Droplet from the 1-Click Inference Ready - Single GPU 1-Click App using the control panel, you can also use the DigitalOcean API. As an example, to create a 4GB 1-Click Inference Ready - Single GPU Droplet in the SFO2 region, you can use the following curl command. You need to either save your API access token) to an environment variable or substitute it in the command below.

curl -X POST -H 'Content-Type: application/json' \
         -H 'Authorization: Bearer '$TOKEN'' -d \
        '{"name":"choose_a_name","region":"sfo2","size":"s-2vcpu-4gb","image": "digitaloceanai-1clickinferencer"}' \
        "https://api.digitalocean.com/v2/droplets"

Getting Started After Deploying 1-Click Inference Ready - Single GPU

Getting started after deploying the Inference Optimized image

  1. Access the Droplet Console

  1. Navigate to the GPU Droplets page.

  2. Locate your newly created Droplet and click on its name.

  3. At the top of your screen, select and launch the Web Console.

  4. Note: For H100x8, it may take an additional 5 minutes for the machine to become fully operational after it has been activated.

  5. Login via SSH


  1. If you selected an SSH key during droplet creation:

    • Open your preferred SSH client (e.g., PuTTY, Terminal).
  2. Use the droplet’s public IP address to log in as root:

ssh root@your_droplet_public_IP
  1. Ensure your SSH key is added to the SSH agent, or specify the key file directly:
ssh -i /path/to/your/private_key root@your_droplet_public_IP
  1. Once connected, you will be logged in as the root user without needing a password.

  2. Execute the Inference Optimized Script


  1. Run the following command:
bash run_model.sh
  1. If it’s your first time, input your Hugging Face token from https://huggingface.co/settings/tokens.

  2. You’ll be presented with the following options (for H100x8):

    • Option 1: Single model (8 GPUs, TP=8)
    • Option 2: Two models concurrently (4 GPUs each, TP=4)
    • Option 3: Four models concurrently (2 GPUs each, TP=2)
  3. For non-H100x8, available models will depend on GPU type.

  4. Select a configuration option and follow the prompts to choose models.

If Option 1 is selected:

Select a model to download and run:
 [0] meta-llama/Llama-3.1-8B-Instruct
 [1] meta-llama/Llama-3.3-70B-Instruct
 [2] meta-llama/Llama-4-Scout-17B-16E-Instruct
 [3] deepseek-ai/DeepSeek-R1-Distill-Llama-70B
 [C] Enter custom model name

If Option 2 is selected:

Select TWO models to run concurrently (4 GPUs each):
 [0] meta-llama/Llama-3.1-8B-Instruct
 [1] meta-llama/Llama-3.3-70B-Instruct
 [2] deepseek-ai/DeepSeek-R1-Distill-Llama-70B

If Option 3 is selected:

Select FOUR models to run concurrently (2 GPUs each):
Note: You can select the same model multiple times for data parallelism!
 [0] meta-llama/Llama-3.1-8B-Instruct
 [1] meta-llama/Llama-3.3-70B-Instruct-FP8
 [2] deepseek-ai/DeepSeek-R1-Distill-Llama-70B-FP8
 [3] meta-llama/Llama-3.3-70B-Instruct-FP8-Speculative-Decoding

For single GPU (example: RTX 6000 ADA), you will see:

[INFO] Detected GPU: NVIDIA RTX 6000 Ada Generation
[INFO] GPU Type: RTX6000ADA
[INFO] RTX6000ADA detected - All models are available

Select a model to download and run:
 [0] meta-llama/Llama-3.1-8B
 [1] meta-llama/Llama-3.1-8B-FP8
 [2] mistralai/Mistral-Nemo-Instruct-2407
 [3] mistralai/Mistral-Nemo-Instruct-2407-FP8
 [C] Enter custom model name

Note: For Options 2 and 3, you can select the same model multiple times to achieve data parallelism.

Sample Option 3 Walkthrough

Choose concurrent model configuration:
 [1] Single model (8 GPUs, tensor-parallel-size=8)
 [2] Two models concurrent (4 GPUs each, tensor-parallel-size=4)
 [3] Four models concurrent (2 GPUs each, tensor-parallel-size=2)

Enter your choice (1-3): 3

Select FOUR models to run concurrently (2 GPUs each):
Enter model choice for slot 1 (0-3): 0
  Slot 1: meta-llama/Llama-3.1-8B-Instruct
Enter model choice for slot 2 (0-3): 1
  Slot 2: meta-llama/Llama-3.3-70B-Instruct-FP8
Enter model choice for slot 3 (0-3): 2
  Slot 3: deepseek-ai/DeepSeek-R1-Distill-Llama-70B-FP8
Enter model choice for slot 4 (0-3): 3
  Slot 4: meta-llama/Llama-3.3-70B-Instruct-FP8-Speculative-Decoding
  1. Check Deployment Status

A summary appears as models are downloaded and vLLM containers are created:

Waiting for all model APIs to become ready. It is expected to take a few minutes.
  [meta-llama/Llama-3.1-8B-Instruct on port 8000] Waiting for API to be ready... Retry in 10 seconds.
...
Deployment on port 8000 is ✅ Ready!

When deployment is ready, you’ll see:

Running Models and Ports

Model 1: meta-llama/Llama-3.1-8B-Instruct
Container: vllm
Port: 8000
CUDA Devices: 0,1
API Endpoint: http://localhost:8000/v1
---
Model 2: meta-llama/Llama-3.3-70B-Instruct-FP8
Container: vllm2
Port: 8001
CUDA Devices: 2,3
API Endpoint: http://localhost:8001/v1
---
Model 3: deepseek-ai/DeepSeek-R1-Distill-Llama-70B-FP8
Container: vllm3
Port: 8002
CUDA Devices: 4,5
API Endpoint: http://localhost:8002/v1
---
Model 4: meta-llama/Llama-3.3-70B-Instruct-FP8-Speculative-Decoding
Container: vllm4
Port: 8003
CUDA Devices: 6,7
API Endpoint: http://localhost:8003/v1

View logs:

docker exec vllm tail -n 100 /var/log/vllm.log
docker exec vllm2 tail -n 100 /var/log/vllm.log
docker exec vllm3 tail -n 100 /var/log/vllm.log
docker exec vllm4 tail -n 100 /var/log/vllm.log

Test APIs:

curl http://localhost:8000/v1/models
curl http://localhost:8001/v1/models
curl http://localhost:8002/v1/models
curl http://localhost:8003/v1/models

At this point

The models are now running. You can verify functionality via logs or API calls.


  1. Troubleshooting

  1. If containers are running, check logs:
docker exec vllm tail -n 100 /var/log/vllm.log
   docker exec vllm2 tail -n 100 /var/log/vllm.log
   docker exec vllm3 tail -n 100 /var/log/vllm.log
   docker exec vllm4 tail -n 100 /var/log/vllm.log
  1. Nvidia driver upgrades are held to avoid version mismatches.

    • However, Ubuntu unattended upgrades may cause CUDA errors.
    • If errors occur, run:
    sudo reboot
     bash run_model.sh
    • View unattended upgrade logs at:
    /var/log/apt/history.log
  2. If you don’t have access to a Hugging Face model, you’ll see:

Access to model mistralai/Mistral-Nemo-Instruct-2407 is restricted and you are not in the authorized list. Visit https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 to ask for access.

  1. For Option 1 with custom models, ensure they fit into H100x8 GPU memory. Most models fit, except very large ones (e.g., deepseek-ai/DeepSeek-R1).
  2. Hugging Face cache info:
/.cache/huggingface
  • Models are under hub/
  • Tokens are stored in stored_tokens and/or token file
  • Delete these if you wish to switch Hugging Face tokens

We can't find any results for your search.

Try using different keywords or simplifying your search terms.