Model config.yaml
The model config.yaml file is the single configuration file that defines how a model
will be trained. It is generated by the App when you create a model,
and it is the primary input expected by the litpose train command. It points to data
directories, defines the type of models to fit, and specifies a wide range of hyperparameters.
A template file can be found here. When training a model on a new dataset, you must copy/paste this template onto your local machine and update the arguments to match your data.
The config file consists of the following sections:
data: information about where data is stored, keypoint names, etc.training: batch size, training epochs, image augmentation, etc.model: backbone architecture, unsupervised losses to use, etc.dali: batch sizes for unlabeled video datalosses: hyperparameters for unsupervised losseseval: paths for video inference
Data parameters
All of these parameters except downsample_factor are dataset-specific and will need to be
provided.
data.image_resize_dims.height/width(int): images (and videos) will be resized to the specified height and width before being processed by the network. Supported values are {64, 128, 256, 384, 512}. The height and width need not be identical. Some points to keep in mind when selecting these values: if the resized images are too small, you will lose resolution/details; if they are too large, the model takes longer to train and might not train as well.data.data_dir/video_dir(str): update these to reflect your (absolute) local pathsdata.csv_file(str): location of labels csv file; this should be relative todata.data_dirdata.downsample_factor(int, default: 2): factor by which to downsample the heatmaps relative todata.image_resize_dimsdata.num_keypoints(int): the number of body parts. If using a mirrored setup, this should be the number of body parts summed across all views. If using a multiview setup, this number should indicate the number of keyponts per view (must be the same across all views).data.keypoint_names(list): keypoint names should reflect the actual names/order in the csv file. This field is necessary if, for example, you are running inference on a machine that does not have the training data saved on it.data.mirrored_column_matches(list): see the Multiview PCA documentationdata.columns_for_singleview_pca(list): see the Pose PCA documentation
Training parameters
The following parameters relate to model training.
Reasonable defaults are provided, though parameters like the batch sizes
(train_batch_size, val_batch_size, test_batch_size)
may need modification depending on the size of the data and the available compute resources.
See the FAQs for more information on memory management.
training.imgaug(str, default: dlc): select from one of several predefined image/video augmentation pipelines:default: resizing only
dlc: imgaug pipeline implmented in DLC 2.0 package
dlc-lr: dlc augmentations plus horizontal flips
dlc-top-down: dlc augmentations plus additional vertical and horizontal flips
You can also define custom augmentation pipelines following these instructions.
training.train_batch_size(int, default: 16): batch size for labeled data during trainingtraining.val_batch_size(int, default: 32): batch size for labeled data during validationtraining.test_batch_size(int, default: 32): batch size for labeled data during testtraining.train_prob(float, default: 0.95): fraction of labeled data used for trainingtraining.val_prob(float, default: 0.05): fraction of labeled data used for validation; any remaining frames not assigned to train or validation sets are assigned to the test settraining.train_frames(float or int, default: 1): this parameter determines how many of the frames assigned to to training data (usingtrain_prob) are actually used for training. This option is generally more useful for testing new algorithms rather than training production models. If the value is a float between 0 and 1 then it is interpreted as the fraction of total train frames. If the value is an integer greater than 1 then it is interpreted as the number of total train frames.training.num_gpus(int, default: 1): the number of GPUs for multi-GPU trainingtraining.num_workers(int, default: num_cpus): number of cpu workers for data loaderstraining.unfreezing_epoch(int, default: 20): epoch at which backbone network weights begin updating. A value >0 allows the smaller number of parameters in the heatmap head to adjust to the backbone outputs first.training.min_epochs/training.max_epochs(int, default: 300): length of training. An epoch is one full pass through the dataset. As an example, if you have 400 labeled frames, andtraining.train_batch_size=10, then your dataset is divided into 400/10 = 40 batches. One “batch” in this case is equivalent to one “iteration” for DeepLabCut. Therefore, 300 epochs, at 40 batches per epoch, is equal to 300*40=12k total batches (or iterations).training.log_every_n_steps(int, default: 10): frequency to log training metrics for tensorboard (one step is one batch)training.check_val_every_n_epochs(int, default: 5): frequency to log validation metrics for tensorboardtraining.ckpt_every_n_epochs(int or null, default: null): save model weights every n epochs; must be divisible bytraining.check_val_every_n_epochsabove. If null, only the best weights will be saved after training, where “best” is defined as the weights from the epoch with the lowest validation loss.training.early_stopping(bool, default: false): if false, the default is to train for the max number of epochs and save out the best model according to the validation loss; if true, early stopping will exit training if the validation loss continues to increase for a given number of validation checks (seetraining.early_stop_patiencebelow).training.early_stop_patience(int, default: 3): number of validation checks over which to assess validation metrics for early stopping; this number, multiplied bytraining.ckpt_every_n_epochs, gives the number of epochs over which the validation loss must increase before exiting.training.rng_seed_data_pt(int, default: 0): rng seed for splitting labeled data into train/val/testtraining.rng_seed_model_pt(int, default: 0): rng seed for weight initialization of the headtraining.optimizer(str, default: Adam): which optimizer to (Adam or AdamW)training.optimizer_params.learning_rate(float, default: 1e-3): optimizer learning ratetraining.lr_scheduler(str, default: multisteplr): reduce the learning rate by a certain factor after a given number of epochs (seetraining.lr_scheduler_params.multisteplrbelow)training.lr_scheduler_params.multistep_lr: milestones: epochs at which to reduce learning rate; gamma: factor by which to multiply learning rate at each milestonetraining.uniform_heatmaps_for_nan_keypoints(bool, default: true): how to treat missing hand labels. Setting this to true will encourage the model to output uniform heatmaps for keypoints that do not have ground truth labels; this will generally lead to low-confidence predictions when a keypoint is occluded. Setting this to false will drop missing keypoints from the loss computation rather than encouraging uniform heatmaps. This generally leads to high confidence predictions even when a keypoint is occluded. Using false may be preferrable if occulsions are brief in time and you want the network to guess where the keypoint should be (rather than signaling uncertainty).training.accumulate_grad_batches(int, default: 1): (experimental) number of batches to accumulate gradients for before updating weights. Simulates larger batch sizes with memory-constrained GPUs. This parameter is not included in the config by default and should be added manually to thetrainingsection.
Model parameters
The following parameters relate to model architecture and unsupervised losses.
model.losses_to_use(list, default: []): defines the unsupervised losses. An empty list indicates a fully supervised model. Each element of the list corresponds to an unsupervised loss. For example,model.losses_to_use=[pca_multiview,temporal]will fit both a pca_multiview loss and a temporal loss. Options include:pca_multiview: penalize inconsistencies between multiple camera views
pca_singleview: penalize implausible body configurations
temporal: penalize large temporal jumps
See the unsupervised losses page for more details on the various losses and their associated hyperparameters.
model.backbone(str, default: resnet50_animal_ap10k): a variety of pretrained backbones are available:resnet50_animal_ap10k: ResNet-50 pretrained on the AP-10k dataset (Yu et al 2021, AP-10k: A Benchmark for Animal Pose Estimation in the Wild)
resnet18: ResNet-18 pretrained on ImageNet
resnet34: ResNet-34 pretrained on ImageNet
resnet50: ResNet-50 pretrained on ImageNet
resnet101: ResNet-101 pretrained on ImageNet
resnet152: ResNet-152 pretrained on ImageNet
resnet50_contrastive: ResNet-50 pretrained on ImageNet using SimCLR
resnet50_animal_apose: ResNet-50 pretrained on an animal pose dataset (Cao et al 2019, Cross-Domain Adaptation for Animal Pose Estimation)
resnet50_animal_ap10k: ResNet-50 pretrained on AP10k dataset (Yu et al 2021, A Benchmark for Animal Pose Estimation in the Wild)
resnet50_human_jhmdb: ResNet-50 pretrained on JHMDB dataset (Jhuang et al 2013, Towards Understanding Action Recognition)
resnet50_human_res_rle: a regression-based ResNet-50 pretrained on MPii dataset (Andriluka et al 2014, 2D Human Pose Estimation: New Benchmark and State of the Art Analysis)
resnet50_human_top_rle: a heatmap-based ResNet-50 pretrained on MPii dataset (Xiao et al 2014, Simple Baselines for Human Pose Estimation and Tracking)
resnet50_human_hand: ResNet-50 pretrained on OneHand10k dataset (Wang et al 2018, Mask-pose Cascaded CNN for 2d Hand Pose Estimation from Single Color Image)
efficientnet_b0: EfficientNet-B0 pretrained on ImageNet
efficientnet_b1: EfficientNet-B1 pretrained on ImageNet
efficientnet_b2: EfficientNet-B2 pretrained on ImageNet
vits_dino: Vision Transformer (Small) pretrained on ImageNet with DINO
vits_dinov2: Vision Transformer (Small) pretrained on ImageNet with DINOv2
vits_dinov3: Vision Transformer (Small) pretrained on ImageNet with DINOv3
vitb_dino: Vision Transformer (Base) pretrained on ImageNet with DINO
vitb_dinov2: Vision Transformer (Base) pretrained on ImageNet with DINOv2
vitb_dinov3: Vision Transformer (Base) pretrained on ImageNet with DINOv3; note this is a gated repo and you will need a Hugging Face account
vitb_imagenet: Vision Transformer (Base) pretrained on ImageNet with MAE loss
vitb_sam: Segment Anything Model (Vision Transformer Base)
Note: the file size for a single ResNet-50 network is approximately 275 MB.
model.model_type(str, default: heatmap):regression: model directly outputs an (x, y) prediction for each keypoint; not recommended
heatmap: model outputs a 2D heatmap for each keypoint
heatmap_mhcrnn: the “multi-head convolutional RNN”, this model takes a temporal window of frames as input, and outputs two heatmaps: one “context-aware” and one “static”. The prediction with the highest confidence is automatically chosen. See the Temporal Context Network page for more information.
heatmap_multiview_transformer: see multi-view docs for more information.
model.heatmap_loss_type(str, default: mse): (experimental) loss to compute difference between ground truth and predicted heatmapsmodel.model_name(str, default: test): directory name for model savingmodel.checkpoint(str or null, default: null): to initialize weights from an existing checkpoint, update this parameter to the absolute path of a pytorch .ckpt file
Video loading parameters
Some parameters relate to video loading, both for semi-supervised models and when predicting new videos with any of the models. The parameters may need modification depending on the size of the data and the available compute resources. See the FAQs for more information on memory management.
dali.base.train.sequence_length(int, default: 32): number of unlabeled frames per batch in “regression” and “heatmap” models (i.e. “base” models that do not use temporal context frames)dali.base.predict.sequence_length(int, default: 96): batch size when predicting on a new video with a base modeldali.context.train.batch_size(int, default: 16): number of unlabeled frames per batch in heatmap_mhcrnn model (i.e. “context” models that utilize temporal context frames)dali.context.predict.sequence_length(int, default: 96): batch size when predicting on a new video with a “context” model
Evaluation
The following parameters are used for general evaluation.
eval.predict_vids_after_training(bool, default: true): if true, after training run inference on all videos located ineval.test_videos_directory(see below)eval.test_videos_directory(str, default: null): absolute path to a video directory containing videos for post-training prediction.eval.save_vids_after_training(bool, default: false): save out an mp4 file with predictions overlaid after running post-training prediction.eval.colormap(str, default: cool): colormap options for labeled videos; options include sequential colormaps (viridis, plasma, magma, inferno, cool, etc) and diverging colormaps (RdBu, coolwarm, Spectral, etc)eval.confidence_thresh_for_vid(float, default: 0.9): predictions with confidence below this value will not be plotted in the labeled videos
Additional Metadata
The following parameters are added to the config by lightning pose on model creation:
creation_datetime: (string) An ISO datetime string of the model creation datetime.model.lightning_pose_version: (string) The lightning-pose package version number that created this model.