Description of image

NVIDIA H100 Tensor Core GPU Reference

Machines are high-performing computing for scaling AI applications.


NVIDIA H100 Tensor Core GPU is a type of GPU that is built with NVIDIA’s Hopper GPU architecture. It is useful in large scale AI and HPC workloads.

To learn more about the NVIDIA H100, read NVIDIA’s Hopper Architecture product outline.

Paperspace now supports the NVIDIA H100 both with a single chip (NVIDIA H100x1) and with eight chips (NVIDIA H100x8), currently in the NYC2 datacenter.

Note
You can only use NVIDIA H100s once Paperspace has approved your request for them.

Machine Types

Here are the machine details for NVIDIA H100.

Name GPU Memory (GB) vCPUs CPU RAM (GB) NVLink Support GPU Interconnect Speeds
NVIDIA H100x1 80 GB 20 250 GB No N/A
NVIDIA H100x8 640 GB 128 1638 GB Yes 3.2 Tb/s

For information about NVLink, see NVIDIA’s NVLink documentation.

Getting Started

NVIDIA H100s are available as on-demand compute which means if there is available capacity, your NVIDIA H100s are immediately accessible once approved by Paperspace.

A VPC private network is required to start any NVIDIA H100 machine. If required, this additionally allows for multi-node training. If your work requires nodes to see a common file system, then you need to provide access to shared drives.

ML-in-a-Box 22.04 Template and Libraries

Once you have access to your NVIDIA H100 GPU, follow our tutorial, Deep Learning with ML in a Box, to learn how to access a generic data science stack.

Warning

Not all libraries and versions work with NVIDIA H100s. If you change your CUDA version or add/remove libraries that differ from the ML-in-a-Box template, this may cause your NVIDIA H100s to not work correctly. See Software Included for the current versions used within ML-in-a-Box.

Ubuntu 20.04 works on all GPUs except H100s. A100-80Gs works with both Ubuntu 20.04 and 22.04 while H100s only works with Ubuntu 22.04.

Performance Specs

The following table displays the performance specifications for NVIDIA H100. We’ve add the NVIDIA A100-80G for reference.

Name Generation Type FP32 CUDA Cores GPU Memory Memory Bandwidth FP64 Tensor Core or FP32 TF32 Tensor Core BFLOAT16 or FP16 Tensor Core FP8 Tensor Core or INT8 Tensor Core
NVIDIA H100x11 Hopper SXM5 16,896 80 GB HBM3 3.35 TB/s 67 TFLOPS 989 TFLOPS 1979 TFLOPS 3958 TFLOPS/TOPS
NVIDIA A100-80Gx1 Ampere SXM4 6,912 80 GB HBM2 1.555 TB/s 19.5 TFLOPS 312 TFLOPS 624 TFLOPS N/A / 1248 TOPS

To learn more about how the NVIDIA H100 GPU compares to other machine types, read our machine types and their performance specs.

Note
NVIDIA H100x8 machines have 3.2 Tb/s speeds with NVLink interconnects between GPUs. This works without any additional setup when using multiple machines on a private network. There are no interconnects and no NVLink when you are using NVIDIA H100x1.

Common Commands

After you have created and connected to your NVIDIA H100 GPU, you can use these commands to help you verify your GPUs’ state such as checking if they are accessible within the environment, or if NVLink support is activated.

  • nvidia-smi: Checks if the GPUs are present. Use the PyTorch command python -m torch.utils.collect_env to get additional information about the environment.
  • nvidia-smi topo -m: Checks if NVLink is available. If it is available, it outputs NV18 between all GPUs on NVIDIA H100.

Best Practices

General Usage

  • After moving 10 GBs or more of data or models, we recommend using checksums (such as md5sum) to verify that the move was successful. Instead of using the mv command to move your files, we also recommend using the cp command to copy the files to the new location first and then use the rm command to delete the old files after you’ve run a checksum on the copied files. This maintains an intact copy of the data that you can use in the event that the copied data was corrupted in the copy process.
  • You may need to bind-mount shared drives and enable ports when using containers. Port publishing works better with --network=host than --publish.
  • If you need to avoid large Docker images, such as when using NVIDIA H100 to fine-tune large language models, you can delete any cached models that were created to free up space, such as ~/.cache/huggingface

Model Training

  • We recommend using tmux while training as it is a tool that allows you to access multiple terminal sessions at once. Since training can result in long computations, tmux is useful as it does not terminate a run if a network error breaks your connection to the terminal, such as if you receive a client_loop: send disconnect: Broken pipe error.

Environments

  • On MacOS, you can access tmux with iTerm2, a terminal replacement with tmux support. Once set up, run the command tmux -CC. After starting your run, close your tmux session using your Esc key from the original window to detach it. Later, you can reopen it with tmux -CC attach. When the process is done, close the window.
  • The full disk size is not the same as the amount of storage available, which is the disk size minus the overhead of the virtual machine template. For example, with ML-in-a-Box, the template uses 28 GB of space, which leaves 66 GB of available storage. To see your storage and disk sizes, view it through your filesystem by running the command df -h. This returns an output like the following:
Filesystem                   Size  Used Avail Use% Mounted on
tmpfs                         25G  1.5M   25G   1% /run
/dev/mapper/ubuntu--vg-root   97G   28G   65G  31% /
tmpfs                        123G     0  123G   0% /dev/shm
tmpfs                        5.0M     0  5.0M   0% /run/lock
/dev/xvda2                   1.7G  232M  1.4G  15% /boot
tmpfs                         25G   92K   25G   1% /run/user/1000

The /dev/mapper/ubuntu--vg-root line shows you how much disk space your machine has available.

  • If you are using a virtual environment such as Python venv, tmux needs to start first before activating the environment.
  • The default Python is 3.11.x, accessed via python or python3.
Note

While the CUDA version and CUDA tools versions may appear not to be the same (12.1 and 12.2), they are in fact correct, and do not conflict when working together.

nvidia-smi
...
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
...

nvcc --version
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Multinode

  • For machines created after 17 January 2024, the template includes an NCCL configuration file /etc/nccl.conf to enable optimal running on the new H100 GPU fabric. This is notable within the /etc/nccl.conf file on the following lines:
NCCL_TOPO_FILE=/etc/nccl/topo.xml
NCCL_IB_DISABLE=0
NCCL_IB_CUDA_SUPPORT=1
NCCL_IB_HCA=mlx5
NCCL_CROSS_NIC=0
NCCL_SOCKET_IFNAME=eth0
NCCL_IB_GID_INDEX=1

For machines created before 17 January 2024, users need to run the following command to create the /etc/nccl.conf file:

sudo bash -c 'cat < /etc/nccl.conf
NCCL_TOPO_FILE=/etc/nccl/topo.xml
NCCL_IB_DISABLE=0
NCCL_IB_CUDA_SUPPORT=1
NCCL_IB_HCA=mlx5
NCCL_CROSS_NIC=0
NCCL_SOCKET_IFNAME=eth0
NCCL_IB_GID_INDEX=1
EOF'

This enables the same optimal running for multinode that machines created after 17 January 2024 have by default.

  • When using containers for multinode, pass the Infiniband device and volumes. You can add these arguments to a docker run command to get the best performance when using multinode on your H100s. Infiniband devices and volumes are related to Infiniband protocols in the upgraded infrastructure.
--device /dev/infiniband/:/dev/infiniband/
--volume /dev/infiniband/:/dev/infiniband/
--volume /sys/class/infiniband/:/sys/class/infiniband/
--volume /etc/nccl/:/etc/nccl/
--volume /etc/nccl.conf:/etc/nccl.conf:ro
Note
To access NVIDIA’s GPU-optimized models, SDKs, and containers for optimal performance, visit the NGC Catalog.

Limitations

  • The following features are not currently available:

    • NUMA mapping is not implemented. CPU affinities are outputted as N/A in nvidia-smi topo -m.
    • Shared drives do not support symbolic links. A shared drive created in the console’s Drives tab and visible from a virtual machine, and bind-mounted to a container does not support symlink:
    >>> ln -s test /mnt/my_shared_drive/test
    ln: failed to create symbolic link '/mnt/my_shared_drive/test': Operation not supported
    

  1. NVIDIA H100 Tensor Core specifications for data types TF32, BFLOAT16, FP16, FP8, and INT8 has sparsity such that the data contains matrices with mostly zeros. For more information on NVIDIA H100 tensor cores, visit NVIDIA’s Tensor Core GPU data sheet↩︎