Self-supervised visual learning from interactions with objects
Arthur Aubret, Céline Teulière, Jochen Triesch
TL;DR
This work tackles the robustness gap in self-supervised visual learning by introducing Action-Aware Self-Supervised Learning (AA-SSL), which couples image embeddings with action representations derived from object interactions. By training on triplets $(x_t, x_{t'}, a_{t,t'})$ and aligning an action embedding with the joint visual representation of adjacent views, AA-SSL achieves improved category recognition and better alignment of similar viewpoints across different objects within a category. The method demonstrates consistent gains across large-scale datasets (RT4K, CO3D, MVImgNet variants) and yields transfer-learning improvements to ImageNet-classification tasks, while showing increased robustness to data-augmentation variations and projection-head architectures. The results suggest that embodied interactions augment SSL by balancing viewpoint invariance with viewpoint sensitivity, enabling more semantic, generalizable object representations with potential applications in data-efficient vision and robotics.
Abstract
Self-supervised learning (SSL) has revolutionized visual representation learning, but has not achieved the robustness of human vision. A reason for this could be that SSL does not leverage all the data available to humans during learning. When learning about an object, humans often purposefully turn or move around objects and research suggests that these interactions can substantially enhance their learning. Here we explore whether such object-related actions can boost SSL. For this, we extract the actions performed to change from one ego-centric view of an object to another in four video datasets. We then introduce a new loss function to learn visual and action embeddings by aligning the performed action with the representations of two images extracted from the same clip. This permits the performed actions to structure the latent visual representation. Our experiments show that our method consistently outperforms previous methods on downstream category recognition. In our analysis, we find that the observed improvement is associated with a better viewpoint-wise alignment of different objects from the same category. Overall, our work demonstrates that embodied interactions with objects can improve SSL of object categories.
