Changing inference batch sizeο
Sometimes model training appears to fail due to a CUDA out of memory error:
Can't allocate 2081423360 bytes on device 0.
Current pipeline object is no longer valid.
terminate called after throwing an instance of 'dali::CUDAError'
what(): CUDA driver API error CUDA_ERROR_INVALID_VALUE (1):
invalid argument
If this happens during training, you can reduce batch size and try again. This is a setting in the model creation UI.
However, if you look more closely up the logs and see this line:
------------------------------------------------------
Predicting videos in cfg.eval.test_videos_directory...
------------------------------------------------------
It means the error did not occur during training, but in the inference that automatically occurs after training. In this case, you need to reduce the inference batch size. The fix is currently manual, and the topic of this doc.
For an existing modelο
Locate the modelβs directory and config.yaml file (
DATA_DIR/models/MODEL_NAME/config.yaml)Open it in a text editor and edit the value of the dali -> base -> predict -> sequence_length from 96 to 32
To re-run inference on all uploaded videos without re-creating the model, you can run in CLI:
litpose predict DATA_DIR/models/MODEL_NAME DATA_DIR/videosor run inference on the same videos from the UI.
The error logs in the UI wont update. This only patches the model whose config you updated, not new models.
For new models going forwardο
When you create a new model, the config.yaml file is generated by applying settings from the UI on top of the base config files stored in
If you place a file at DATA_DIR/configs/default.yaml, it gets used instead of the default files above.
So you can download the file from the link above, put it at the specified location, override the inference batch size there.
Note
See the full documentation for the Model configuration file.