.. _multiview_separate: ################################ Multiview: separate data streams ################################ In addition to the mirrored setups discussed on the previous page, Lightning Pose also supports more traditional multiview data, where the same scene is captured from different angles with different cameras. We offer a multi-view transformer solution that processes all views simultaneously, learning cross-view correlations to improve performance. Similar to the single view setup, Lightning Pose produces a separate csv file with the predicted keypoints for each video .. note:: As of July 2024, the non-mirrored multiview feature of Lightning Pose now supports context frames and some unsupervised losses. The Multiview PCA loss operates across all views, while the temporal loss operates on single views. The Pose PCA loss is not yet implemented for the multiview case. Organizing your data ==================== As an example, let’s assume a dataset has two camera views from a given session ("session0"), which we’ll call “view0” and “view1”. Lightning Pose assumes the following project directory structure: .. code-block:: /path/to/project/ ├── / │ ├── session0_view0/ │ └── session0_view1/ ├── / │ ├── session0_view0.mp4 │ └── session0_view1.mp4 ├── view0.csv └── view1.csv * ``/``: The directory name, any subdirectory names, and image names are all flexible, as long as they are consistent with the first column of `.csv` files (see below). As an example, each session/view pair can have its own subdirectory, which contains images that correspond to the labels. The same frames from all the views must have the same names; for example, the images corresponding to time point 39 should be named "/session0_view0/img000039.png" and "/session0_view1/img000039.png". * ``/``: This is a single directory of videos, which **must** following the naming convention ``_.csv``. So in our example there should be two videos, named ``session0_view0.mp4`` and ``session0_view1.mp4``. * ``.csv``: For each view (camera) there should be a table with keypoint labels (rows: frames; columns: keypoints). Note that these files can take any name, and need to be listed in the config file under the ``data.csv_file`` section. Each csv file must contain the same set of keypoints, and each must have the same number of rows (corresponding to specific points in time). The configuration file ====================== Like the single view case, users interact with Lighting Pose through a single configuration file. This file points to data directories, defines the type of models to fit, and specifies a wide range of hyperparameters. A template file can be found `here `_. When training a model on a new dataset, you must copy/paste this template onto your local machine and update the arguments to match your data. To switch to multiview from single view you need to change two data parameters. Again, assume that we are working with the two-view dataset used as an example above: .. code-block:: yaml data: csv_file: - view0.csv - view1.csv view_names: - view0 - view1 mirrored_column_matches: [see bullet below] columns_for_singleview_pca: [see bullet below] * ``csv_file``: list of csv filenames for each view * ``view_names``: list view names * ``mirrored_column_matches``: if you would like to use the Multiview PCA loss, you must ensure the following: (1) the same set of keypoints are labeled across all views (though there can be missing data); (2) this config field should be a list of the indices corresponding to a *single view* which are included in the loss for all views; for example if you have 10 keypoints in each view, and you want to include the zeroth, first, and fifth in the Multiview PCA loss, this field should look like ``mirrored_column_matches: [0, 1, 5]``; (3) as in the non-multiview case, you must specify you want to use this loss :ref:`elsewhere in the config file `. * ``columns_for_singleview_pca``: NOT YET IMPLEMENTED To utilize the multi-view transformer, modify the following entries: .. code-block:: yaml model: backbone: vits_dino model_type: heatmap_multiview_transformer The backbone can be any of the available backbones that start with the string "vit", indicating Vision Transformer. The "heatmap_multiview_transformer" will then use the specified backbone to process all camera view simultaneously. Patch masking ============= The self-attention of the MVT enables the network to utilize information from multiple views, which is particularly advantageous for handling occlusions. To encourage the model to develop this cross-view reasoning during training, we introduce a pixel space patch masking scheme inspired by the success of masked autoencoders and dropout. We use a training curriculum that starts with a short warmup period where no patches are masked (controlled by `training.patch_mask.init_step` in the config file), then increase the ratio of masked patches over the course of training (controlled by `trainin.patch_mask.init/final_ratio). This technique creates gradients that flow through the attention mechanism and encourage cross-view information propagation, which in turn develops internal representations that capture statistical relationships between the different views. .. code-block:: yaml training: patch_mask: init_step: 0 # step to start patch masking final_step: 5000 # step when patch masking reaches maximum init_ratio: 0.0 # initial masking ratio final_ratio: 0.5 # final masking ratio To turn patch masking off, set `final_ratio: 0.0`. 3D augmentations and loss ========================= .. code-block:: yaml training: imgaug: dlc imgaug_3d: true losses: supervised_pairwise_projections: log_weight: 0.5 Training and inference ====================== Once the data are properly organized and the config files updated, :ref:`training ` and :ref:`inference ` in this multiview setup proceed exactly the same as for the single view case. Because the trained network is view-agnostic, during inference videos are processed and saved one view at a time.