.. _multi_gpu_training:

###################
Multi-GPU Training
###################

Multi-GPU training allows you to distribute the load of model training across GPUs.
This helps overcome OOMs in addition to accelerating training.

To use this feature, set :ref:`num_gpus <config_num_gpus>` in your config file.

How to choose batch_size
========================

Multi-GPU training distributes batches across multiple GPUs in a way that maintains the same
effective batch size as if you ran on 1 GPU. **Thus, if you reduced batch size in order to make
your model fit in one GPU, you should increase it back to your desired effective batch size.**

The batch size configuration parameters that this applies to are ``training.train_batch_size`` and
``training.val_batch_size`` for the labeled frames, and ``dali.train.base.sequence_length`` and
``dali.train.context.batch_size`` for unlabeled video frames. Test batch sizes are not relevant
to this document as testing only occurs on one GPU.

Calculate of per-GPU batch size
-------------------------------

Given the above, you need not worry about how lightning-pose calculates per-GPU batch size,
but it is documented here for transparency.

In general the per-GPU batch size will be:

.. code-block:: python

    ceil(batch_size / num_gpus)

The exception to this is the unlabeled per-GPU batch size for context models (``heatmap_mhcrnn``):

.. code-block:: python

    ceil((batch_size - 4) / num_gpus) + 4

The adjusted calculation for the unlabeled batch size for context models maintains the same
single-GPU effective batch size by accounting for the 4 context frames that are loaded with each
training frame.
For example, if you specified `dali.context.train.batch_size=16`, then your effective batch size
was 16 - 4 = 12.
To maintain 12 with 2 GPUs, each GPU will load 6 frames + 4 context frames, for a per-GPU batch
size of 10.
This is larger than simply dividing the original batch size of 16 across 2 GPUs.

.. _execution_model:

Execution model
===============

.. warning::
    The implementation spawns ``num_gpus - 1`` processes of the same command originally executed,
    repeating all of the command's execution per process.
    Thus it is advised to only run multi-GPU training in a dedicated training script
    (``litpose train``). If you use lightning-pose as part of a custom script and don't
    want your entire script to run once per GPU, your script should run ``litpose train``
    rather than directly calling the ``train`` method.

Tensorboard metric calculation
==============================

All metrics can be interpreted the same way as with a single-GPU.
The metrics are the average value across the GPUs. 

Specifying the GPUs to run on
=============================

Use the environment variable ``CUDA_VISIBLE_DEVICES`` if you want lightning pose to run on certain
GPUs. For example, if you want to train on only the first two GPUs on your machine,

.. code-block:: bash

    CUDA_VISIBLE_DEVICES=0,1 litpose train config.yaml