Machines are Linux and Windows virtual machines with persistent storage, GPU options, and free unlimited bandwidth. They’re designed for high-performance computing (HPC) workloads.
We recommend the following best practices to optimize your machine’s performance and prevent technical issues. These best practices emphasize model and data safety as well as training optimizations.
A snapshot is a disk image of your machine at a specific point in time. Snapshots are useful as backups of your machine because you can use them to revert your machine to a previous state.
You should use snapshots as machine backups if you’re:
Snapshots are disk images of a machine’s hard drive, thus acting as reliable backups of your machine from a specific point in time. It protects the work you’ve done on the machine from unintended consequences, like data corruption, system failures, or accidental disconnections from Paperspace. Snapshots restore your machine with minimal downtime.
You can set up snapshots by either taking a manual snapshot or setting up auto-snapshots.
We recommend taking snapshots of your machine at regular intervals based on how frequently your model or data changes and how often you modify your machine’s configuration.
tmux is a terminal multiplexer which lets you create and use multiple terminal sessions simultaneously.
You should use tmux for model training if you’re:
tmux improves model training by using multiple accessible and active sessions as backups so your training sessions continue running even if you are inactive on your machine or a terminal disconnects. If one your terminals has a network error, tmux offloads the run to a different active terminal.
Additionally, tmux allows you to close a terminal session and start a new one without disrupting running processes in other sessions. With multiple terminals, you can train several models or run multiple tests on your models simultaneously, thereby shortening your training and testing processes.
You can set up tmux on your machine by installing tmux and configuring its .tmux.conf
file.
If you’re using a virtual environment, such as Python venv
, you need to start tmux before activating the environment so tmux is aware of the environments you’re using.
Multi-node computing involves using multiple nodes for high-performance computing tasks.
You should use multi-node computing if you’re:
Multi-node computing improves resource scalability by distributing resources and data across multiple nodes as more are added to your machine. It also offers better resource utilization, as each node handles different parts of the computation or HPC task.
Using multiple nodes ensures that your training session continues even if one node fails, as the session transfers to another active node. By splitting large datasets and training tasks across nodes, multi-node computing reduces training and testing processes.
Moreover, running multiple training sessions simultaneously helps identify the ideal hyperparameters for your model more quickly because you can run and test different hyperparameter combinations of your model on different nodes.
When using containers for multi-node setups, we recommend adding Infiniband devices and volumes to the container runtime configuration file because it helps achieve optimal performance in a multi-node environment. The following is an example configuration:
--device /dev/infiniband/:/dev/infiniband/
--volume /dev/infiniband/:/dev/infiniband/
--volume /sys/class/infiniband/:/sys/class/infiniband/
--volume /etc/nccl/:/etc/nccl/
--volume /etc/nccl.conf:/etc/nccl.conf:ro
--device
allows the environment to access specific hardware, such as Infiniband devices, which improves the efficiency of data transfers and resource sharing.--volume
mounts host directories in the multi-node environment so it has access to the necessary files and directories that may improve its performance.
For example, --volume /etc/nccl/:/etc/nccl/
mounts the NVIDIA Collective Communications Library (NCCL) configuration directory to the multi-node environment. NCCL optimizes the multi-node environment’s performance by enabling efficient multi-node data transfers.