.. _training: ######## Training ######## Lightning Pose provides several tools for training models: #. A set of high-level functions used for creating data loaders, models, trainers, etc. You can combine these to create your own custom training script. This is required if you used the :ref:`pip package ` installation method. #. An example training script provided in the :ref:`conda from source ` installation method. This demonstrates how to combine the high-level functions for model training and evaluation. .. note:: The steps below assume the :ref:`conda from source ` installation method. If you did not use this installation method, see the `example training script `_. Train with example data ======================= To train a model on the example dataset provided with the Lightning Pose package, run the following command from inside the ``lightning-pose`` directory (make sure you have activated your conda environment): .. code-block:: console python scripts/train_hydra.py Note there are no arguments - this tells the script to default to the example data. Train with your data ==================== To train a model on your own dataset, follow these steps: #. Ensure your data is in the :ref:`proper data format `. #. Copy the file ``scripts/configs/config_default.yaml`` to another directory and rename it. You will then need to update the various fields to match your dataset (see :ref:`The configuration file ` section). See other config files in ``scripts/configs/`` for examples. #. Train your model from the terminal and overwrite the config path and config name with your newly created file: .. code-block:: console python scripts/train_hydra.py --config-path= --config-name= You can find more information on the structure of the output model directory :ref:`below `. Working with ``hydra`` ====================== All of the scripts in the ``scripts`` directory rely on the ``hydra`` package to manage arguments in config files. You have two options: directly edit the config file, or override it from the command line. #. **Edit** the config file, and save it. Then run the script without arguments: .. code-block:: console python scripts/train_hydra.py #. **Override** the argument from the command line; for example, if you want to use a maximum of 11 epochs instead of the default number (not recommended): .. code-block:: console python scripts/train_hydra.py training.max_epochs=11 Or, for your own dataset, .. code-block:: python scripts/train_hydra.py --config-path= --config-name= training.max_epochs=11 We also recommend trying out training with resizing to smaller images first; this allows for larger batch sizes/fewer Out Of Memory errors on the GPU: .. code-block:: console python scripts/train_hydra.py --config-path= --config-name= data.image_resize_dims.height=256 data.image_resize_dims.width=256 See more documentation on the config file fields :ref:`here `. A couple of fields that are specific to the provided training script, but important to consider: * ``eval.predict_vids_after_training``: if ``true``, automatically run inference after training on all videos located in the directory given by ``eval.test_videos_directory``; results are saved to the model directory * ``eval.save_vids_after_training``: if ``true`` (as well as ``eval.predict_vids_after_training``) the keypoints predicted during the inference step will be overlaid on the videos and saved with inference outputs to the model directory Tensorboard =========== The outputs of the training script, namely the model checkpoints and tensorboard logs, will be saved in the ``lightning-pose/outputs/YYYY-MM-DD/HH-MM-SS/tb_logs`` directory by default. (Note: this behavior can be changed by updating ``hydra.run.dir`` in the config file to an absolute path of your choosing.) To view the logged losses with tensorboard in your browser, in the command line, run: .. code-block:: console tensorboard --logdir outputs/YYYY-MM-DD/ where you use the date in which you ran the model. Click on the provided link in the terminal, which will look something like ``http://localhost:6006/``. Note that if you save the model at a different directory, just use that directory after ``--logdir``. .. note:: If you don't see all your models in tensorboard, hit the refresh button on the top right corner of the screen, and the other models should appear. Metrics are plotted as a function of step/batch. Validation metrics are typically recorded less frequently than train metrics. The frequency of these checks are controlled by ``cfg.training.log_every_n_steps`` (training) and ``cfg.training.check_val_every_n_epoch`` (validation). **Available metrics** The following are the important metrics for all model types (supervised, context, semi-supervised, etc.): * ``train_supervised_loss``: this is the same as ``train_heatmap_mse_loss_weighted``, which is the mean square error (MSE) between the true and predicted heatmaps on labeled training data * ``train_supervised_rmse``: the root mean square error (RMSE) between the true and predicted (x, y) coordinates on labeled training data; scale is in pixels * ``val_supervised_loss``: this is the same as ``val_heatmap_mse_loss_weighted``, which is the MSE between the true and predicted heatmaps on labeled validation data * ``val_supervised_rmse``: the RMSE between the true and predicted (x, y) coordinates on labeled validation data; scale is in pixels The following are important metrics for the semi-supervised models: * ``train_pca_multiview_loss_weighted``: the ``train_pca_multiview_loss`` (in pixels), which measures multiview consistency, multplied by the loss weight set in the configuration file. This metric is only computed on batches of unlabeled training data. * ``train_pca_singleview_loss_weighted``: the ``train_pca_singleview_loss`` (in pixels), which measures pose plausibility, multplied by the loss weight set in the configuration file. This metric is only computed on batches of unlabeled training data. * ``train_temporal_loss_weighted``: the ``train_temporal_loss`` (in pixels), which measures temporal smoothness, multplied by the loss weight set in the configuration file. This metric is only computed on batches of unlabeled training data. * ``total_unsupervised_importance``: a weight on all *weighted* unsupervised losses that linearly increases from 0 to 1 over 100 epochs * ``total_loss``: weighted supervised loss (``train_heatmap_mse_loss_weighted``) plus ``total_unsupervised_importance`` times the sum of all applicable weighted unsupervised losses .. _model_directory_structure: Model directory structure ========================= If you train a model using our script ``lightning-pose/scripts/train_hydra.py``, a directory will be created with the following structure. The default is to save models in a directory called ``outputs`` inside the Lightning Pose directory; to change this, update the config fields ``hydra.run.dir`` and ``hydra.sweep.dir`` with absolute paths of your choosing. .. code-block:: /path/to/models/YYYY-MM-DD/HH-MM-SS/ ├── tb_logs/ ├── video_preds/ │   └── labeled_videos/ ├── config.yaml ├── predictions.csv ├── predictions_pca_multiview_error.csv ├── predictions_pca_singleview_error.csv └── predictions_pixel_error.csv * ``tb_logs/``: model weights * ``video_preds/``: predictions and metrics from videos. The config field ``eval.test_videos_directory`` points to a directory of videos; if ``eval.predict_vids_after_training`` is set to ``true``, all videos in the indicated direcotry will be run through the model upon training completion and results stored here. * ``video_preds/labeled_videos/``: labeled mp4s. The config field ``eval.test_videos_directory`` points to a directory of videos; if ``eval.save_vids_after_training`` is set to ``true``, all videos in the indicated direcotry will be run through the model upon training completion and results stored here. * ``predictions.csv``: predictions on labeled data. The right-most column records the train/val/test split that each example belongs to. * ``predictions_pixel_error.csv``: Euclidean distance between the predictions in ``predictions.csv`` and the labeled keypoints (in ``.csv``) per keypoint and frame. We also compute all unsupervised losses, where applicable, and store them (per keypoint and frame) in the following csvs: * ``predictions_pca_multiview_error.csv``: pca multiview reprojection error between predictions and labeled keypoints * ``predictions_pca_singleview_error.csv``: pca singleview reprojection error between predictions and labeled keypoints