Table of Contents
Fetching ...

Self-supervised visual learning from interactions with objects

Arthur Aubret, Céline Teulière, Jochen Triesch

TL;DR

This work tackles the robustness gap in self-supervised visual learning by introducing Action-Aware Self-Supervised Learning (AA-SSL), which couples image embeddings with action representations derived from object interactions. By training on triplets $(x_t, x_{t'}, a_{t,t'})$ and aligning an action embedding with the joint visual representation of adjacent views, AA-SSL achieves improved category recognition and better alignment of similar viewpoints across different objects within a category. The method demonstrates consistent gains across large-scale datasets (RT4K, CO3D, MVImgNet variants) and yields transfer-learning improvements to ImageNet-classification tasks, while showing increased robustness to data-augmentation variations and projection-head architectures. The results suggest that embodied interactions augment SSL by balancing viewpoint invariance with viewpoint sensitivity, enabling more semantic, generalizable object representations with potential applications in data-efficient vision and robotics.

Abstract

Self-supervised learning (SSL) has revolutionized visual representation learning, but has not achieved the robustness of human vision. A reason for this could be that SSL does not leverage all the data available to humans during learning. When learning about an object, humans often purposefully turn or move around objects and research suggests that these interactions can substantially enhance their learning. Here we explore whether such object-related actions can boost SSL. For this, we extract the actions performed to change from one ego-centric view of an object to another in four video datasets. We then introduce a new loss function to learn visual and action embeddings by aligning the performed action with the representations of two images extracted from the same clip. This permits the performed actions to structure the latent visual representation. Our experiments show that our method consistently outperforms previous methods on downstream category recognition. In our analysis, we find that the observed improvement is associated with a better viewpoint-wise alignment of different objects from the same category. Overall, our work demonstrates that embodied interactions with objects can improve SSL of object categories.

Self-supervised visual learning from interactions with objects

TL;DR

This work tackles the robustness gap in self-supervised visual learning by introducing Action-Aware Self-Supervised Learning (AA-SSL), which couples image embeddings with action representations derived from object interactions. By training on triplets and aligning an action embedding with the joint visual representation of adjacent views, AA-SSL achieves improved category recognition and better alignment of similar viewpoints across different objects within a category. The method demonstrates consistent gains across large-scale datasets (RT4K, CO3D, MVImgNet variants) and yields transfer-learning improvements to ImageNet-classification tasks, while showing increased robustness to data-augmentation variations and projection-head architectures. The results suggest that embodied interactions augment SSL by balancing viewpoint invariance with viewpoint sensitivity, enabling more semantic, generalizable object representations with potential applications in data-efficient vision and robotics.

Abstract

Self-supervised learning (SSL) has revolutionized visual representation learning, but has not achieved the robustness of human vision. A reason for this could be that SSL does not leverage all the data available to humans during learning. When learning about an object, humans often purposefully turn or move around objects and research suggests that these interactions can substantially enhance their learning. Here we explore whether such object-related actions can boost SSL. For this, we extract the actions performed to change from one ego-centric view of an object to another in four video datasets. We then introduce a new loss function to learn visual and action embeddings by aligning the performed action with the representations of two images extracted from the same clip. This permits the performed actions to structure the latent visual representation. Our experiments show that our method consistently outperforms previous methods on downstream category recognition. In our analysis, we find that the observed improvement is associated with a better viewpoint-wise alignment of different objects from the same category. Overall, our work demonstrates that embodied interactions with objects can improve SSL of object categories.
Paper Structure (27 sections, 2 equations, 7 figures, 6 tables)

This paper contains 27 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: A) Example of object interactions from four datasets. B) Summary of the learning process of AA-SSL. See text for details.
  • Figure 2: Scatter plot of categorization accuracy vs. level of invariance in the representation (measured as the cosine similarity of the representations of adjacent video frames) for several methods based on SimCLR. We compute the level of invariance on A) the test set of CO3D, B) the test set of RT4K and C) the train set of RT4K. $t$ indicates the SimCLR temperature applied for learning the action representation and $R$ denotes the maximal rotation angle between an image and its positive pair during training. Horizontal bars show the standard deviation over the 3 seeds, when available. The best performing AA-SSL shows an intermediate level of invariance.
  • Figure 3: PaCMAP wang2021understanding visualization of all embeddings of test images in category "Chair" in RT4K. Representations are colored according to A) the object instance associated to their image and B) the object yaw orientation. Note how only AA-SimCLR aligns representations of similar views of different objects from the same category.
  • Figure 4: A) Illustration of how we compute our object-wise versus view-wise category generalization metric for a given view of a chair. B) Level of viewpoint invariance versus view-wise category generalization on RT4K after training with methods based on SimCLR.
  • Figure 5: RT4K category accuracy when using reduced sets of data-augmentations during training. AA-SimCLR shows the highest robustness when one or more data-augmentations are removed.
  • ...and 2 more figures