Table of Contents
Fetching ...

Learning to See the Elephant in the Room: Self-Supervised Context Reasoning in Humans and AI

Xiao Liu, Soumick Sarker, Ankur Sikarwar, Bryan Atista Kiely, Gabriel Kreiman, Zenglin Shi, Mengmi Zhang

TL;DR

SeCo (Self-supervised learning for Context Reasoning), a biologically inspired model that learns contextual relationships from complex scenes that outperforms state-of-the-art self-supervised learning approaches and predicts object placements most consistent with human behaviour, highlighting the central role of contextual associations in scene understanding.

Abstract

Humans rarely perceive objects in isolation but interpret scenes through relationships among co-occurring elements. How such contextual knowledge is acquired without explicit supervision remains unclear. Here we combine human psychophysics experiments with computational modelling to study the emergence of contextual reasoning. Participants were exposed to novel objects embedded in naturalistic scenes that followed predefined contextual rules capturing global context, local context and crowding. After viewing short training videos, participants completed a "lift-the-flap" task in which a hidden object had to be inferred from the surrounding context under variations in size, resolution and spatial arrangement. Humans rapidly learned these contextual associations without labels or feedback and generalised robustly across contextual changes. We then introduce SeCo (Self-supervised learning for Context Reasoning), a biologically inspired model that learns contextual relationships from complex scenes. SeCo encodes targets and context with separate vision encoders and stores latent contextual priors in a learnable external memory module. Given contextual cues, the model retrieves likely object representations to infer hidden targets. SeCo outperforms state-of-the-art self-supervised learning approaches and predicts object placements most consistent with human behaviour, highlighting the central role of contextual associations in scene understanding.

Learning to See the Elephant in the Room: Self-Supervised Context Reasoning in Humans and AI

TL;DR

SeCo (Self-supervised learning for Context Reasoning), a biologically inspired model that learns contextual relationships from complex scenes that outperforms state-of-the-art self-supervised learning approaches and predicts object placements most consistent with human behaviour, highlighting the central role of contextual associations in scene understanding.

Abstract

Humans rarely perceive objects in isolation but interpret scenes through relationships among co-occurring elements. How such contextual knowledge is acquired without explicit supervision remains unclear. Here we combine human psychophysics experiments with computational modelling to study the emergence of contextual reasoning. Participants were exposed to novel objects embedded in naturalistic scenes that followed predefined contextual rules capturing global context, local context and crowding. After viewing short training videos, participants completed a "lift-the-flap" task in which a hidden object had to be inferred from the surrounding context under variations in size, resolution and spatial arrangement. Humans rapidly learned these contextual associations without labels or feedback and generalised robustly across contextual changes. We then introduce SeCo (Self-supervised learning for Context Reasoning), a biologically inspired model that learns contextual relationships from complex scenes. SeCo encodes targets and context with separate vision encoders and stores latent contextual priors in a learnable external memory module. Given contextual cues, the model retrieves likely object representations to infer hidden targets. SeCo outperforms state-of-the-art self-supervised learning approaches and predicts object placements most consistent with human behaviour, highlighting the central role of contextual associations in scene understanding.
Paper Structure (25 sections, 2 equations, 17 figures, 3 algorithms)

This paper contains 25 sections, 2 equations, 17 figures, 3 algorithms.

Figures (17)

  • Figure 1: Humans learn contextual rules from complex natural scenes without explicit instructions or feedback.A. Humans perceive scenes holistically rather than as isolated objects. Through exposure to rich, multi-object environments, they implicitly learn contextual associations without the need for explicit instruction or supervision. B. Foveated vision, with high resolution at the center of gaze and lower resolution in the periphery, supports object-centric representations of complex scenes. When fixating on a table (red), surrounding objects are organized into a table-centered scene graph khandelwal2023adaptive. Bounding box colors denote contextual associations: local co-occurrence (e.g., mug on table), global context (e.g., stove in kitchen), and crowding (e.g., multiple chairs near the table). C. To systematically study contextual reasoning in humans and AI, we introduce two evaluation tasks: lift-the-flap and object priming. In lift-the-flap (left), agents use scene context to infer the identity of a hidden object behind a black patch framed in red. In object priming (right), given a scene and a target object that is not already present, agents predict contextually appropriate locations for placing the object. D. FRibble In the sceNE (FRINE) dataset for studying how humans learn to reason from context. Without relying on prior contextual knowledge of familiar objects, we construct the FRINE dataset using novel objects. We begin by selecting four novel object families (Fa1, Fb1, Fb3, Fc1) from the Novel Object Dataset (NOD) singh2023learning. These “fribbles” have distinct body structures and appendages that are unfamiliar to humans as shown in D1. Next, we replace eight common household object classes, which serve as anchor objects in the Unity-based indoor scene simulator VirtualHome puig2018virtualhome, with fribbles. D2 shows an example VirtualHome apartment, alongside example images of the eight household objects in D3, each color-coded according to the global, local, and crowding associations defined in B. The mapping between anchor objects and fribbles defines the three novel contextual rules, as illustrated in D4. Each column represents a contextual rule, and each row shows the fribble assigned to the corresponding anchor object class within that rule. See Methods for details on FRINE dataset.
  • Figure 2: We introduce human psychophysics experiments in learning to reason on the FRINE dataset.A. Schematic of the Human Psychophysics Experiments. The experiment comprises two phases: a training phase and a testing phase. In the training phase, participants were shown a sequence of 40 training video clips from the FRINE dataset. Each clip lasted 10 to 20 seconds and depicted a novel fribble object centered within a naturalistic scene. The camera rotated around the fribble object in each video. See B for an example training clip. Participants were assigned to either a supervised (Sup) or self-supervised learning (SSL) condition. In the Sup condition (clips framed in black), the fribble object was highlighted with a red bounding box and labeled (e.g., “fb1”). In the SSL condition (clips framed in gray), no labels were shown. Importantly, each participant viewed training videos from only one learning condition (either Sup or SSL) throughout the entire training phase. In B, the first three frames are shown in the Sup condition and the remaining frames in SSL for illustration only. Frame timestamps are shown above each frame, and a scale bar on the right indicates the frame size in degrees of visual angle. During the test phase, participants viewed 40 test clips from the FRINE dataset. In each clip, the fribble object was occluded by a black patch, and participants were required to infer its identity based solely on contextual information, making a 4-way forced-choice classification. C. Context Manipulations in the Lift-the-Flap Task. To investigate the role of context, we introduced four context variations in the test clips: normal context, blur context, context areas, and jigsaw context. D. Example Test Clip. We show an example test clip under the normal context condition, where the central fribble object is hidden beneath a black patch while the surrounding context remains intact. Frame timestamps and video frame sizes (in degrees of visual angle) are annotated similarly to B. E. Individual subject performance differences in the lift-the-flap task. Violin plots show the distribution of top-1 accuracy across individual participants under SUP (green) and SSL (orange) learning conditions, with each black dot representing a single subject. The box spans the interquartile range (25th–75th percentiles); whiskers denote the full range of the data. F. Response time across human subjects for SUP and SSL conditions. The plot shows the proportion of SUP (black) and SSL (dark grey) participants as a function of their mean reaction time per trial (in seconds). Vertical dotted lines indicate the median response time for each group. G. Reaction times for correct and incorrect trials in SUP and SSL humans. Trials were separated into correct and incorrect responses for SUP and SSL participants, and their average reaction time was computed for each learning condition. Error bars indicate the standard error of the mean (SEM). “n.s.” indicates no statistically significant difference between the two distributions ($p > 0.05$).
  • Figure 3:
  • Figure 4: We propose the Self-supervised Method with External Memories for Context Reasoning, named as SeCo.A. Model schematic of SeCo. The architecture consists of three components: a target discovery module, a two-stream visual processor (trapezoids), and an external memory module (orange squircle). During pre-training, given a full image $I_f$, SeCo employs an unsupervised method (selective search) to generate potential object proposals. These proposals are then converted into multiple context–target image pairs $(I_c, I_t)$. For each pair, SeCo processes the context and target separately using non-shared encoders ($E_c$, $E_t$) followed by non-shared projectors ($P_c$, $P_t$). In parallel, SeCo incorporates a trainable external memory to store contextual priors. Ideally, if $E_c$ encodes strong contextual cues, its latent representation should serve as an effective query to retrieve a relevant object representation from the external memory based on learned keys $P_k$ and values. A joint training objective is applied: the mean squared error loss $L_{\text{mse}}$ encourages alignment between the retrieved embedding from the memory (based on context) and the target object embedding, while additional regularization losses—the covariance loss $L_{\text{cov}}$ and the variance loss $L_{\text{var}}$—promote diversity and prevent collapse of the learned representations (shown as stacked rectangles). B. During fine-tuning, SeCo is adapted to the downstream lift-the-flap task. Given a test image with a hidden object, the pre-trained frozen context encoder $E_c$ is used to extract contextual representations, followed by a fully connected layer to predict the label of the hidden object (e.g., "keyboard") behind the black patch via linear probing. C. Datasets used for pre-training, fine-tuning, and testing AI models. C1 The COCO-OCD dataset contains naturalistic images from COCO-Stuff cocostuff overlapping with object classes in the OCD dataset whenpigsfly. C2 The COCO-VOC dataset contains COCO-Stuff images overlapping with object classes in PASCAL-VOC07 voc07. For each dataset, in-domain and out-of-domain test sets are provided, with target objects listed below each context image. D. Top-1 accuracy of human participants and AI models on the lift-the-flap task under normal context conditions using the FRINE dataset. From left to right: SUP humans (black), SSL humans (gray), SeCo (red), self-supervised learning (SSL) baselines including ORL xie2021unsupervised, SimSiam simsiam, VICReg vicreg, DINO dino, SimCLR simclr, and Context Encoder contextencoder (orange), and a supervised learning baseline (green). A total of 517 SUP human trials, 548 SSL human trials, and 1,926 trials per model were collected. Error bars indicate standard errors computed across all trials for each group. E. Top-1 accuracy of human participants and SeCo under normal context conditions for Rules 1–3 of the FRINE dataset. Number of trials per rule is indicated in brackets: Rule 1—SUP humans (165), SSL humans (186); Rule 2—SUP humans (180), SSL humans (187); Rule 3—SUP humans (172), SSL humans (175). SeCo was evaluated on 1,926 trials for each rule. F. Top-1 accuracy under normal context conditions for SUP humans, SSL humans, and SeCo across different context associations in the FRINE dataset. For each association type, the number of trials exceeds 640 for SUP humans, SSL humans, and SeCo. See Methods for definitions of the three context association types. Across D, E, and F, the chance level (25%) is indicated by a horizontal black dashed line. Error bars indicate the standard error of the mean (SEM). * denotes performance significantly above chance based on Welch's two-tailed t-test ($p < 0.05$).
  • Figure 5:
  • ...and 12 more figures