Paperspace > Machines > Reference > NVIDIA H100 Reference

Was this page helpful?

NVIDIA H100 Tensor Core GPU Reference

Validated on 14 Dec 2023 • Last edited on 17 May 2024 gpu | cuda | vcpu | core machine | bare metal

Machines are high-performing computing for scaling AI applications.

NVIDIA H100 Tensor Core GPU is a type of GPU that is built with NVIDIA’s Hopper GPU architecture. It is useful in large scale AI and HPC workloads.

To learn more about the NVIDIA H100, read NVIDIA’s Hopper Architecture product outline.

Paperspace now supports the NVIDIA H100 both with a single chip (NVIDIA H100x1) and with eight chips (NVIDIA H100x8), currently in the NYC2 datacenter.

Note

You can only use NVIDIA H100s once Paperspace has approved your request for them.

Machine Types

Here are the machine details for NVIDIA H100.

Name	GPU Memory (GB)	vCPUs	CPU RAM (GB)	NVLink Support	GPU Interconnect Speeds
NVIDIA H100x1	80 GB	20	250 GB	No	N/A
NVIDIA H100x8	640 GB	128	2048 GB	Yes	3.2 Tb/s

For information about NVLink, see NVIDIA’s NVLink documentation.

Getting Started

NVIDIA H100s are available as on-demand compute which means if there is available capacity, your NVIDIA H100s are immediately accessible once approved by Paperspace.

A VPC private network is required to start any NVIDIA H100 machine. If required, this additionally allows for multi-node training. If your work requires nodes to see a common file system, then you need to provide access to shared drives.

Template and Libraries

ML-in-a-Box 22.04 Template

Once you have access to your NVIDIA H100 GPU, follow the Deep Learning with ML in a Box tutorial to learn how to access a generic data science stack. When using the ML-in-a-Box template, you do not need to disable NVLink for H100x1 machines.

Warning

Not all libraries and versions work with NVIDIA H100s. If you change your CUDA version or add/remove libraries that differ from the ML-in-a-Box template, this may cause your NVIDIA H100s to not work correctly. See Software Included for the current versions used within ML-in-a-Box.

Ubuntu 20.04 works on all GPUs except H100s. A100-80Gs work with both Ubuntu 20.04 and 22.04, while H100s only work with Ubuntu 22.04.

Ubuntu 22.04 Base Image

For H100x1, you are required to install and properly configure NVIDIA drivers and CUDA without fabric-manager which is the management software for NVLink. H100x1 requires you to disable NVLink both at the system level and the RAM disk (initrd) the system uses to boot up. This ensures that the CUDA starts running. You can follow this guide to disable NVLink in order to successfully run CUDA with NVIDIA H100x1 machines on a Ubuntu 22.04 base image.

For H100x8, you are required to install and properly configure NVIDIA drivers, CUDA with fabric-manager. H100x8 does not require you to disable NVLink with Ubuntu.

Performance Specs

The following table displays the performance specifications for NVIDIA H100. We’ve add the NVIDIA A100-80G for reference.

Name	Generation	Type	FP32 CUDA Cores	GPU Memory	Memory Bandwidth	FP64 Tensor Core or FP32	TF32 Tensor Core	BFLOAT16 or FP16 Tensor Core	FP8 Tensor Core or INT8 Tensor Core
NVIDIA H100x1¹	Hopper	SXM5	16,896	80 GB HBM3	3.35 TB/s	67 TFLOPS	989 TFLOPS	1979 TFLOPS	3958 TFLOPS/TOPS
NVIDIA A100-80Gx1	Ampere	SXM4	6,912	80 GB HBM2	1.555 TB/s	19.5 TFLOPS	312 TFLOPS	624 TFLOPS	N/A / 1248 TOPS

To learn more about how the NVIDIA H100 GPU compares to other machine types, read our machine types and their performance specs.

Note

NVIDIA H100x8 machines have 3.2 Tb/s speeds with NVLink interconnects between GPUs. This works without any additional setup when using multiple machines on a private network. There are no interconnects and no NVLink when you are using NVIDIA H100x1.

Common Commands

After you have created and connected to your NVIDIA H100 GPU, you can use these commands to help you verify your GPUs’ state such as checking if they are accessible within the environment, or if NVLink support is activated.

nvidia-smi: Checks if the GPUs are present. Use the PyTorch command python -m torch.utils.collect_env to get additional information about the environment.
nvidia-smi topo -m: Checks if NVLink is available. If it is available, it outputs NV18 between all GPUs on NVIDIA H100.

Best Practices

General Usage

After moving 10 GBs or more of data or models, we recommend using checksums (such as md5sum) to verify that the move was successful. Instead of using the mv command to move your files, we also recommend using the cp command to copy the files to the new location first and then use the rm command to delete the old files after you’ve run a checksum on the copied files. This maintains an intact copy of the data that you can use in the event that the copied data was corrupted in the copy process.
You may need to bind-mount shared drives and enable ports when using containers. Port publishing works better with --network=host than --publish.
To free up space, such as from cached Docker images, you can delete the cached models created when fine-tuning large language models. For example, ~/.cache/huggingface

Model Training

We recommend using tmux while training as it is a tool that allows you to access multiple terminal sessions at once. Since training can result in long computations, tmux is useful as it does not terminate a run if a network error breaks your connection to the terminal, such as if you receive a client_loop: send disconnect: Broken pipe error.

Environments

On MacOS, you can access tmux with iTerm2, a terminal replacement with tmux support. Once set up, run the command tmux -CC. After starting your run, close your tmux session using your Esc key from the original window to detach it. Later, you can reopen it with tmux -CC attach. When the process is done, close the window.
The full disk size is not the same as the amount of storage available, which is the disk size minus the overhead of the virtual machine template. For example, with ML-in-a-Box, the template uses 28 GB of space, which leaves 66 GB of available storage. To see your storage and disk sizes, view it through your filesystem by running the command df -h. This returns an output like the following:

Filesystem                   Size  Used Avail Use% Mounted on
tmpfs                         25G  1.5M   25G   1% /run
/dev/mapper/ubuntu--vg-root   97G   28G   65G  31% /
tmpfs                        123G     0  123G   0% /dev/shm
tmpfs                        5.0M     0  5.0M   0% /run/lock
/dev/xvda2                   1.7G  232M  1.4G  15% /boot
tmpfs                         25G   92K   25G   1% /run/user/1000

The /dev/mapper/ubuntu--vg-root line shows you how much disk space your machine has available.

If you are using a virtual environment such as Python venv, tmux needs to start first before activating the environment.
The default Python is 3.11.x, accessed via python or python3.

Note

H100 machines use 12.2 as the CUDA driver version and 12.1 as the CUDA runtime version.

nvidia-smi
...
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
...

nvcc --version
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Multinode

For machines created after 17 January 2024, the template includes an NCCL configuration file /etc/nccl.conf to enable optimal running on the new H100 GPU fabric. This is notable within the /etc/nccl.conf file on the following lines:

NCCL_TOPO_FILE=/etc/nccl/topo.xml
NCCL_IB_DISABLE=0
NCCL_IB_CUDA_SUPPORT=1
NCCL_IB_HCA=mlx5
NCCL_CROSS_NIC=0
NCCL_SOCKET_IFNAME=eth0
NCCL_IB_GID_INDEX=1

For machines created before 17 January 2024, users need to run the following command to create the /etc/nccl.conf file:

sudo bash -c 'cat < /etc/nccl.conf
NCCL_TOPO_FILE=/etc/nccl/topo.xml
NCCL_IB_DISABLE=0
NCCL_IB_CUDA_SUPPORT=1
NCCL_IB_HCA=mlx5
NCCL_CROSS_NIC=0
NCCL_SOCKET_IFNAME=eth0
NCCL_IB_GID_INDEX=1
EOF'

This enables the same optimal running for multinode that machines created after 17 January 2024 have by default.

When using containers for multinode, pass the Infiniband device and volumes. You can add these arguments to a docker run command to get the best performance when using multinode on your H100s. Infiniband devices and volumes are related to Infiniband protocols in the upgraded infrastructure.

--device /dev/infiniband/:/dev/infiniband/
--volume /dev/infiniband/:/dev/infiniband/
--volume /sys/class/infiniband/:/sys/class/infiniband/
--volume /etc/nccl/:/etc/nccl/
--volume /etc/nccl.conf:/etc/nccl.conf:ro

Note

To access NVIDIA’s GPU-optimized models, SDKs, and containers for optimal performance, visit the NGC Catalog.

Limitations

The following features are not currently available:
- NUMA mapping is not implemented. CPU affinities are outputted as N/A in nvidia-smi topo -m.
- Shared drives do not support symbolic links. A shared drive created in the console’s Drives tab and visible from a virtual machine, and bind-mounted to a container does not support symlink:

>>> ln -s test /mnt/my_shared_drive/test
ln: failed to create symbolic link '/mnt/my_shared_drive/test': Operation not supported

NVIDIA H100 Tensor Core specifications for data types TF32, BFLOAT16, FP16, FP8, and INT8 has sparsity such that the data contains matrices with mostly zeros. For more information on NVIDIA H100 tensor cores, visit NVIDIA’s Tensor Core GPU data sheet. ↩︎

NVIDIA H100 Tensor Core GPU Reference

Machine Types

Getting Started

Template and Libraries

ML-in-a-Box 22.04 Template

Ubuntu 22.04 Base Image

Performance Specs

Common Commands

Best Practices

General Usage

Model Training

Environments

Multinode

Limitations

We can't find any results for your search.