Learning Object Semantic Similarity with Self-Supervision

Arthur Aubret; Timothy Schaumlöffel; Gemma Roig; Jochen Triesch

Learning Object Semantic Similarity with Self-Supervision

Arthur Aubret, Timothy Schaumlöffel, Gemma Roig, Jochen Triesch

TL;DR

The paper addresses how semantic relations among objects can emerge from temporal co-occurrences in egocentric vision. It introduces a bio-inspired framework that combines temporal slowness (SSLTT) with visuo-language alignment to induce context-aware object representations from raw visual data and category labels, using a dataset built from MVImageNet. Key findings show that high-level layers develop context-based clustering aligned with object contexts (e.g., kitchen, bathroom) and that temporal structure, more than background cues, drives this organization; layer depth and sparsity mediate the balance between context, category, and identity. The approach provides a plausible computational mechanism for origins of semantic knowledge in humans and offers a path for learning context-sensitive representations from multimodal, temporally structured input.

Abstract

Humans judge the similarity of two objects not just based on their visual appearance but also based on their semantic relatedness. However, it remains unclear how humans learn about semantic relationships between objects and categories. One important source of semantic knowledge is that semantically related objects frequently co-occur in the same context. For instance, forks and plates are perceived as similar, at least in part, because they are often experienced together in a ``kitchen" or ``eating'' context. Here, we investigate whether a bio-inspired learning principle exploiting such co-occurrence statistics suffices to learn a semantically structured object representation {\em de novo} from raw visual or combined visual and linguistic input. To this end, we simulate temporal sequences of visual experience by binding together short video clips of real-world scenes showing objects in different contexts. A bio-inspired neural network model aligns close-in-time visual representations while also aligning visual and category label representations to simulate visuo-language alignment. Our results show that our model clusters object representations based on their context, e.g. kitchen or bedroom, in particular in high-level layers of the network, akin to humans. In contrast, lower-level layers tend to better reflect object identity or category. To achieve this, the model exploits two distinct strategies: the visuo-language alignment ensures that different objects of the same category are represented similarly, whereas the temporal alignment leverages that objects from the same context are frequently seen in succession to make their representations more similar. Overall, our work suggests temporal and visuo-language alignment as plausible computational principles for explaining the origins of certain forms of semantic knowledge in humans.

Learning Object Semantic Similarity with Self-Supervision

TL;DR

Abstract

Paper Structure (20 sections, 2 equations, 7 figures)

This paper contains 20 sections, 2 equations, 7 figures.

Introduction
Related work
Slowly changing visual representations
Impact of visual co-occurrences on object representations
Methods
Temporal Sequences of Egocentric Visual Inputs
Dataset
Creation of Temporal Sequences of Images
Representation Learning
Self-Supervised Learning Through Time (SSLTT)
Simulating Language Guidance
Training and Evaluation
Training
Evaluation
Experiments
...and 5 more sections

Figures (7)

Figure 1: We simulate extended egocentric visual experience of objects in different contexts. Objects from the same context have a high probability of being seen in succession (intra-context transition), while this probability is reduced for objects from different contexts (inter-context transition).
Figure 2: Learning architecture. See text for details.
Figure 3: Differently trained models evaluated with odd-one-out test accuracy on randomly assigned context labels. The models utilize either one of the two losses or both at the same time. "SSLTT-VLA*" denotes SSLTT-VLA trained and tested with random context assignments. The labels refer to the evaluated layer of the network.
Figure 4: Visualization of the image representations using random images from the test dataset on the second layer of $h_1$. The colors represent the eight different contexts. The left plot shows the model trained using an image sequence with $p_c=0.1$, whereas the right plot uses $p_c=1.0$.
Figure 5: Odd-one-out test accuracy for context, category and object instance labels measured at different levels of the network.
...and 2 more figures

Learning Object Semantic Similarity with Self-Supervision

TL;DR

Abstract

Learning Object Semantic Similarity with Self-Supervision

Authors

TL;DR

Abstract

Table of Contents

Figures (7)