Multi-GPU Training

Multi-GPU training allows you to distribute the load of model training across GPUs. This helps overcome OOMs in addition to accelerating training.

To use this feature, set num_gpus in your config file.

How to choose batch_size

Multi-GPU training distributes batches across multiple GPUs in a way that maintains the same effective batch size as if you ran on 1 GPU. Thus, if you reduced batch size in order to make your model fit in one GPU, you should increase it back to your desired effective batch size.

The batch size configuration parameters that this applies to are training.train_batch_size and training.val_batch_size for the labeled frames, and dali.train.base.sequence_length and dali.train.context.batch_size for unlabeled video frames. Test batch sizes are not relevant to this document as testing only occurs on one GPU.

Calculate of per-GPU batch size

Given the above, you need not worry about how lightning-pose calculates per-GPU batch size, but it is documented here for transparency.

In general the per-GPU batch size will be:

ceil(batch_size / num_gpus)

The exception to this is the unlabeled per-GPU batch size for context models (heatmap_mhcrnn):

ceil((batch_size - 4) / num_gpus) + 4

The adjusted calculation for the unlabeled batch size for context models maintains the same single-GPU effective batch size by accounting for the 4 context frames that are loaded with each training frame. For example, if you specified dali.context.train.batch_size=16, then your effective batch size was 16 - 4 = 12. To maintain 12 with 2 GPUs, each GPU will load 6 frames + 4 context frames, for a per-GPU batch size of 10. This is larger than simply dividing the original batch size of 16 across 2 GPUs.

Execution model

Warning

The implementation spawns num_gpus - 1 processes of the same command originally executed, repeating all of the command’s execution per process. Thus it is advised to only run multi-GPU training in a dedicated training script (litpose train). If you use lightning-pose as part of a custom script and don’t want your entire script to run once per GPU, your script should run litpose train rather than directly calling the train method.

Tensorboard metric calculation

All metrics can be interpreted the same way as with a single-GPU. The metrics are the average value across the GPUs.

Specifying the GPUs to run on

Use the environment variable CUDA_VISIBLE_DEVICES if you want lightning pose to run on certain GPUs. For example, if you want to train on only the first two GPUs on your machine,

CUDA_VISIBLE_DEVICES=0,1 litpose train config.yaml