Give Feedback

Machines Best Practices

Validated on 7 Aug 2024 • Last edited on 25 Oct 2024

Machines are Linux and Windows virtual machines with persistent storage, GPU options, and free unlimited bandwidth. They’re designed for high-performance computing (HPC) workloads.

We recommend the following best practices to optimize your machine’s performance and prevent technical issues. These best practices emphasize model and data safety as well as training optimizations.

Use Snapshots as Machine Backups

A snapshot is a disk image of your machine at a specific point in time. Snapshots are useful as backups of your machine because you can use them to revert your machine to a previous state.

When Should I Do This?

You should use snapshots as machine backups if you’re:

Testing new configurations.
Installing new software.
Updating your machine.
Saving a machine version after completing a milestone or important task.

Why Should I Do This?

Snapshots are disk images of a machine’s hard drive, thus acting as reliable backups of your machine from a specific point in time. It protects the work you’ve done on the machine from unintended consequences, like data corruption, system failures, or accidental disconnections from Paperspace. Snapshots restore your machine with minimal downtime.

How Do I Implement This?

You can set up snapshots by either taking a manual snapshot or setting up auto-snapshots.

We recommend taking snapshots of your machine at regular intervals based on how frequently your model or data changes and how often you modify your machine’s configuration.

Use Terminal Multiplexer (tmux) For Model Training

tmux is a terminal multiplexer which lets you create and use multiple terminal sessions simultaneously.

When Should I Do This?

You should use tmux for model training if you’re:

Running long model training sessions.
Managing multiple tasks or models concurrently.
Monitoring training progress across multiple sessions.

How Does This Improve Performance?

tmux improves model training by using multiple accessible and active sessions as backups so your training sessions continue running even if you are inactive on your machine or a terminal disconnects. If one your terminals has a network error, tmux offloads the run to a different active terminal.

Additionally, tmux allows you to close a terminal session and start a new one without disrupting running processes in other sessions. With multiple terminals, you can train several models or run multiple tests on your models simultaneously, thereby shortening your training and testing processes.

How Do I Implement This?

You can set up tmux on your machine by installing tmux and configuring its .tmux.conf file.

If you’re using a virtual environment, such as Python venv, you need to start tmux before activating the environment so tmux is aware of the environments you’re using.

Use Multi-node Computing For Model Training

Multi-node computing involves using multiple nodes for high-performance computing tasks.

When Should I Do This?

You should use multi-node computing if you’re:

Training complex models that require extensive computational power and memory.
Working with a large dataset, especially when the dataset doesn’t fit into your machine’s memory.
Running long model training sessions.
Managing multiple tasks or models concurrently.

How Does This Improve Performance?

Multi-node computing improves resource scalability by distributing resources and data across multiple nodes as more are added to your machine. It also offers better resource utilization, as each node handles different parts of the computation or HPC task.

Using multiple nodes ensures that your training session continues even if one node fails, as the session transfers to another active node. By splitting large datasets and training tasks across nodes, multi-node computing reduces training and testing processes.

Moreover, running multiple training sessions simultaneously helps identify the ideal hyperparameters for your model more quickly because you can run and test different hyperparameter combinations of your model on different nodes.

How Do I Implement This?

When using containers for multi-node setups, we recommend adding Infiniband devices and volumes to the container runtime configuration file because it helps achieve optimal performance in a multi-node environment. The following is an example configuration:

--device /dev/infiniband/:/dev/infiniband/
--volume /dev/infiniband/:/dev/infiniband/
--volume /sys/class/infiniband/:/sys/class/infiniband/
--volume /etc/nccl/:/etc/nccl/
--volume /etc/nccl.conf:/etc/nccl.conf:ro

--device allows the environment to access specific hardware, such as Infiniband devices, which improves the efficiency of data transfers and resource sharing.
--volume mounts host directories in the multi-node environment so it has access to the necessary files and directories that may improve its performance. For example, --volume /etc/nccl/:/etc/nccl/ mounts the NVIDIA Collective Communications Library (NCCL) configuration directory to the multi-node environment. NCCL optimizes the multi-node environment’s performance by enabling efficient multi-node data transfers.

Machines Best Practices

Use Snapshots as Machine Backups

When Should I Do This?

Why Should I Do This?

How Do I Implement This?

Use Terminal Multiplexer (tmux) For Model Training

When Should I Do This?

How Does This Improve Performance?

How Do I Implement This?

Use Multi-node Computing For Model Training

When Should I Do This?

How Does This Improve Performance?

How Do I Implement This?

We can't find any results for your search.