Cropzoom pipeline
For setups where an animal is freely moving in a large arena, it’s advantageous to crop around the animal before running pose estimation. Lightning Pose calls this technique “cropzoom”. This document describes how to set up such a pipeline.
Tip
This pipeline works equally well with bounding boxes from any external source
(e.g., idtracker.ai,
SAM3, or your own custom detection scripts).
Wherever the examples below call litpose predict [detector_model] and
litpose create_bbox [detector_model], substitute your own scripts that produce
bbox CSV files in the required format (see Using bboxes from an external source
below). All downstream steps are identical.
Conceptual overview
A cropzoom pipeline consists of two Lightning Pose models: a “detector model” and a “pose model”.
The detector model operates on the full image of the arena.
The pose model operates on the cropped animal.
These two models are trained and predicted like any other Lightning Pose model. We provide additional tools that help you compose these models:
litpose create_bbox: Given the detector model’s predictions, computes per-frame bounding boxes and saves them as CSV files.litpose smooth_bbox: (optional) Applies temporal smoothing to bbox CSV files, which can reduce jitter in the cropped region.litpose crop: Given a directory of bbox CSV files, crops the animal out of each frame or video.litpose remap: Given the pose model’s predictions and the crop bounding boxes, remaps the predictions to the original coordinate space.
Alternatively, litpose predict --bbox_dir combines the crop, predict, and remap steps
into a single command — see Prediction on videos for details.
For the full command-line reference for these tools, see the CLI page sections: Create bbox, Smooth bbox, Crop, and Remap.
Training
Training involves:
Train a “detector model”.
Predict on training data using the detector model.
Create bounding boxes from detector predictions.
Crop training data for the pose model.
Train a “pose model”.
Inference
Inference involves:
Predict using the “detector model”.
Create bounding boxes from detector predictions.
(optional) Smooth the bounding boxes.
Crop the data using the bounding boxes.
Predict on the cropped data using the “pose model”.
Remap the pose model’s predictions to the original coordinate space.
Note
Steps 4–6 can be replaced by a single litpose predict --bbox_dir call —
see Prediction on videos in the example below.
Bounding box sizing
The litpose create_bbox command supports two ways to size the bounding box around the
animal. The two options are mutually exclusive; if neither is provided,
--crop_ratio=2.0 is used.
--crop_ratio(default)Sizes the bounding box relative to the per-frame span of the detected keypoints. A value of 2.0 (the default) produces a box twice as wide and tall as the spread of keypoints. Use this when the animal’s apparent size varies across frames.
litpose create_bbox $MODEL_DIR/$DETECTOR_MODEL data/videos/test_vid.mp4 --crop_ratio 2.0
--crop_sizeProduces a fixed square bounding box of the given pixel size, centred on the per-frame mean of the detected keypoints. Use this when the animal occupies a roughly consistent region of the frame and you want uniform crop dimensions.
litpose create_bbox $MODEL_DIR/$DETECTOR_MODEL data/videos/test_vid.mp4 --crop_size 200
Bounding box smoothing
After running litpose create_bbox, you can optionally smooth the resulting bbox files
with litpose smooth_bbox. Smoothing reduces per-frame jitter in the crop region,
which can improve pose estimation quality when the detector produces noisy predictions.
Smoothed bboxes are written to a new directory alongside a metadata.json file
that records the smoothing parameters. You can then pass this directory to
litpose crop or litpose predict via --bbox_dir.
litpose smooth_bbox $MODEL_DIR/$DETECTOR_MODEL/video_preds \
--output_dir $MODEL_DIR/$DETECTOR_MODEL/video_preds/bboxes_smooth_w5 \
--window 5
Using bboxes from an external source
Because litpose crop and litpose predict both accept a --bbox_dir argument,
you can use bounding boxes produced by any external tool, not just litpose create_bbox
(e.g., idtracker.ai,
SAM3, etc.).
Place your bbox CSV files in a directory following the naming convention below, then pass
--bbox_dir to either command:
Videos:
<video_stem>_bbox.csv(one file per video)Labeled frames:
bbox.csv
Each bbox CSV must have columns x, y, h, w (top-left corner and size in
pixels), with one row per frame.
Example
This is a basic example of how you can setup a cropzoom pipeline. Paths to CSV and MP4 files below should be replaced with your files. The example is illustrative only. In reality you might be interested in making modifications to this such as:
Using different model type, backbone, image_resize_dims for your detector model and pose model. This can be accomplished using different config files for the detector and pose model.
Limiting
train_framesandmax_epochsfor testing purposes.Choosing
--crop_ratioor--crop_sizeto suit your data (see Bounding box sizing above).
We’ll use some bash variables to avoid repeating paths below:
MODEL_DIR=outputs/chickadee/cropzoom
DETECTOR_MODEL=detector_0
POSE_MODEL=pose_supervised_0
Training script
#!/bin/bash
# Train the detector model.
litpose train config.yaml --output_dir $MODEL_DIR/$DETECTOR_MODEL
# Predict on training data with the detector model.
litpose predict $MODEL_DIR/$DETECTOR_MODEL data/CollectedData.csv
# Create bounding boxes from detector predictions.
# Use --crop_ratio (default) or --crop_size to control bounding box size.
litpose create_bbox $MODEL_DIR/$DETECTOR_MODEL data/CollectedData.csv
# Crop images for pose model training.
litpose crop $MODEL_DIR/$DETECTOR_MODEL data/CollectedData.csv
# Train the pose model.
litpose train config.yaml --output_dir $MODEL_DIR/$POSE_MODEL \
--detector_model $MODEL_DIR/$DETECTOR_MODEL
For command-line options of the commands used above, see Create bbox and Crop.
Prediction on videos
Pros
Intermediate cropped videos are stored on disk, making it straightforward to inspect each stage of the pipeline and diagnose potential issues in the detector or pose estimator.
Cons
Cropped videos are written to disk, requiring additional storage and compute.
An extra remap step is needed to convert predictions back to the original coordinate space.
#!/bin/bash
litpose predict $MODEL_DIR/$DETECTOR_MODEL data/videos/test_vid.mp4
litpose create_bbox $MODEL_DIR/$DETECTOR_MODEL data/videos/test_vid.mp4
# Optional: smooth bboxes before cropping.
litpose smooth_bbox $MODEL_DIR/$DETECTOR_MODEL/video_preds \
--output_dir $MODEL_DIR/$DETECTOR_MODEL/video_preds/bboxes_smooth
# Crop using raw bboxes (default):
litpose crop $MODEL_DIR/$DETECTOR_MODEL data/videos/test_vid.mp4
# Or crop using smoothed bboxes:
litpose crop $MODEL_DIR/$DETECTOR_MODEL data/videos/test_vid.mp4 \
--bbox_dir $MODEL_DIR/$DETECTOR_MODEL/video_preds/bboxes_smooth
litpose predict $MODEL_DIR/$POSE_MODEL \
$MODEL_DIR/$DETECTOR_MODEL/cropped_videos/cropped_test_vid.mp4
litpose remap $MODEL_DIR/$POSE_MODEL/video_preds/cropped_test_vid.csv \
$MODEL_DIR/$DETECTOR_MODEL/video_preds/test_vid_bbox.csv
Pros
Runs entirely in memory — no intermediate cropped files are written to disk.
Predictions are returned in the original coordinate space; no remap step is needed.
Cons
Without intermediate cropped files, it is harder to inspect the detector output or diagnose issues at each stage of the pipeline.
#!/bin/bash
litpose predict $MODEL_DIR/$DETECTOR_MODEL data/videos/test_vid.mp4
litpose create_bbox $MODEL_DIR/$DETECTOR_MODEL data/videos/test_vid.mp4
# Optional: smooth bboxes.
litpose smooth_bbox $MODEL_DIR/$DETECTOR_MODEL/video_preds \
--output_dir $MODEL_DIR/$DETECTOR_MODEL/video_preds/bboxes_smooth
# Predict using raw bboxes:
litpose predict $MODEL_DIR/$POSE_MODEL data/videos/test_vid.mp4 \
--bbox_dir $MODEL_DIR/$DETECTOR_MODEL/video_preds
# Or predict using smoothed bboxes:
litpose predict $MODEL_DIR/$POSE_MODEL data/videos/test_vid.mp4 \
--bbox_dir $MODEL_DIR/$DETECTOR_MODEL/video_preds/bboxes_smooth
For detailed command-line options, see Create bbox, Smooth bbox, Crop, and Remap.
Limitations
Pose models do not yet support PCA Multiview loss.