Self-Supervised Learning of Motion-Informed Latents

29 Sep 2021 · Raphaël Jean, Pierre-Luc St-Charles, Soren Pirk, Simon Brodeur ·

Siamese network architectures trained for self-supervised instance recognition can learn powerful visual representations that are useful in various tasks. Many such approaches work by simply maximizing the similarity between representations of augmented images of the same object. In this paper, we further expand on the success of these methods by studying an unusual training scheme for learning motion-informed representations. Our goal is to show that common Siamese networks can effectively be trained on video sequences to disentangle attributes related to pose and motion that are useful for video and non-video tasks, yet typically suppressed in usual training schemes. Unlike parallel efforts that focus on introducing new image-space operators for data augmentation, we argue that extending the augmentation strategy by using different frames of a video leads to more powerful representations. To show the effectiveness of this approach, we use the Objectron and UCF101 datasets to learn representations and evaluate them on pose estimation, action recognition, and object re-identification. We show that self-supervised learning using in-domain video sequences yields better results on different task than fine-tuning pre-trained networks on still images. Furthermore, we carefully validate our method against a number of baselines.

PDF Abstract