Contrastive Video Representation Learning

Introduced by Qian et al. in Spatiotemporal Contrastive Video Representation Learning

Contrastive Video Representation Learning, or CVRL, is a self-supervised contrastive learning framework for learning spatiotemporal visual representations from unlabeled videos. Representations are learned using a contrastive loss, where two clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. Data augmentations are designed involving spatial and temporal cues. Concretely, a temporally consistent spatial augmentation method is used to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames. A sampling-based temporal augmentation method is also used to avoid overly enforcing invariance on clips that are distant in time.

End-to-end, from a raw video, we first sample a temporal interval from a monotonically decreasing distribution. The temporal interval represents the number of frames between the start points of two clips, and we sample two clips from a video according to this interval. Afterwards we apply a temporally consistent spatial augmentation to each of the clips and feed them into a 3D backbone with an MLP head. The contrastive loss is used to train the network to attract the clips from the same video and repel the clips from different videos in the embedding space.

Source: Spatiotemporal Contrastive Video Representation Learning

Read Paper See Code