Marketplace > Llama 3.1 405B Instruct FP8 - Multi GPU

Was this page helpful?

Llama 3.1 405B Instruct FP8 - Multi GPU

Last edited on 22 Oct 2024 • Generated on 13 Nov 2024

This page is automatically generated from the DigitalOcean Marketplace using content on the Llama 3.1 405B Instruct FP8 - Multi GPU catalog page.

The Meta-Llama-3.1-405B-Instruct-FP8 is a 405 billion parameter, multilingual, large language model optimized for dialogue use cases, trained on a diverse mix of publicly available online data and fine-tuned for safety and helpfulness.

Model ID: meta-llama/Meta-Llama-3.1-405B-Instruct-FP8

Supported Language(s): en, de, fr, it, pt, hi, es, th

License: Llama3.1

Modality: text

Hardware Support

GPU Model	Number of accelerators	Max Input Tokens	Max New Tokens
NVIDIA H100	8	20928	20960

Software Included

Package	Version	License
Meta Llama 3.1	3.1-405B-Instruct-FP8	Llama3.1

Creating an App using the Control Panel

Click the Deploy to DigitalOcean button to create a Droplet based on this 1-Click App. If you aren’t logged in, this link will prompt you to log in with your DigitalOcean account.

Creating an App using the API

In addition to creating a Droplet from the Llama 3.1 405B Instruct FP8 - Multi GPU 1-Click App using the control panel, you can also use the DigitalOcean API. As an example, to create a 4GB Llama 3.1 405B Instruct FP8 - Multi GPU Droplet in the SFO2 region, you can use the following curl command. You need to either save your API access token) to an environment variable or substitute it in the command below.

curl -X POST -H 'Content-Type: application/json' \
         -H 'Authorization: Bearer '$TOKEN'' -d \
        '{"name":"choose_a_name","region":"sfo2","size":"s-2vcpu-4gb","image": "digitaloceanai-llama31405binstructfp8"}' \
        "https://api.digitalocean.com/v2/droplets"

Getting Started After Deploying Llama 3.1 405B Instruct FP8 - Multi GPU

Quickly Get Started With Your 1-Click Models

Access the Droplet Console:
- Navigate to the GPU Droplets page.
- Locate your newly created 1-Click Model Droplet and click on its name.
- Under the “Access” tab, select Console. This will open an in-browser terminal session connected to your droplet.
- Log in as the root user using the password you set during droplet creation.

Login via SSH:

If you selected an SSH key during droplet creation, follow these steps:- Open your preferred SSH client (e.g., PuTTY, Terminal).

Use the droplet’s public IP address to log in as root:

ssh root@your_droplet_public_IP

+ Ensure your SSH key is added to the SSH agent, or specify the key file directly:

ssh -i /path/to/your/private_key root@your_droplet_public_IP

+ Once connected, you will be logged in as the root user without needing a password.

Check the Message of the Day (MOTD) for Access Token:
- Upon successful login via console or SSH, the Message of the Day (MOTD) will be displayed.
- This message includes important information such as the bearer token. Take note of this token as you’ll need it to use the inference API for your model.

Troubleshooting

Please note that the models require a couple of minutes to load, as the docker containers is started for the respective model. During this process any API calls to the model will timeout.
To ensure that Caddy is working, run:

sudo systemctl status caddy

Usage Examples

Using cURL

You can make a local API call using this cURL command:

curl -X 'POST' \
  'http://<your_droplet_ip>/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer '<your_token_here>'' \
  -d '{
    "model": "<model_name>",
    "messages": [{"role":"user", "content":"What is Deep Learning?"}],
    "max_tokens": 64,
    "stream": false
}'

Using Python with `huggingface_hub`

from huggingface_hub import InferenceClient

client = InferenceClient(
    base_url="http://0.0.0.0:8080/v1",
    api_key="-",
)

output = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Count to 10"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content, end="")

Using Python with OpenAI library

from openai import OpenAI

client = OpenAI(
    api_key="-",
    base_url="http://0.0.0.0:8080/v1"
)
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "What is deep learning?"},
    ],
    stream=True,
    max_tokens=64,
)

# Iterate and print stream
for message in response:
    print(message.choices[0].delta.content, end="")

This works with every OpenAI client including JavaScript.