Table of Contents
Fetching ...

Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations

Bhishma Dedhia, Niraj K. Jha

TL;DR

This work addresses the grounding gap in object-centric representations by introducing Neural Slot Interpreter (NSI), a co-training framework that grounds object concepts into emergent slots via a nested object-centric schema. NSI learns scene and schema representations with a bi-level architecture and a contrastive objective that aligns slot embeddings with schema primitives, enabling flexible, many-to-one grounding beyond traditional one-slot-one-object mappings. Across synthetic and real-world datasets, NSI improves grounding accuracy, object discovery, and downstream few-shot reasoning, outperforming bounding-box grounded approaches and plain slot-based methods, while remaining data-efficient. The approach demonstrates that grounded, interpretable slot tokens can serve as effective visual substrates for downstream tasks and motivates multimodal extensions of object-centric grounding.

Abstract

Several accounts of human cognition posit that our intelligence is rooted in our ability to form abstract composable concepts, ground them in our environment, and reason over these grounded entities. This trifecta of human thought has remained elusive in modern intelligent machines. In this work, we investigate whether slot representations extracted from visual scenes serve as appropriate compositional abstractions for grounding and reasoning. We present the Neural Slot Interpreter (NSI), which learns to ground object semantics in slots. At the core of NSI is a nested schema that uses simple syntax rules to organize the object semantics of a scene into object-centric schema primitives. Then, the NSI metric learns to ground primitives into slots through a structured contrastive learning objective that reasons over the intermodal alignment. Experiments with a bi-modal object-property and scene retrieval task demonstrate the grounding efficacy and interpretability of correspondences learned by NSI. From a scene representation standpoint, we find that emergent NSI slots that move beyond the image grid by binding to spatial objects facilitate improved visual grounding compared to conventional bounding-box-based approaches. From a data efficiency standpoint, we empirically validate that NSI learns more generalizable representations from a fixed amount of annotation data than the traditional approach. We also show that the grounded slots surpass unsupervised slots in real-world object discovery and scale with scene complexity. Finally, we investigate the downstream efficacy of the grounded slots. Vision Transformers trained on grounding-aware NSI tokenizers using as few as ten tokens outperform patch-based tokens on challenging few-shot classification tasks.

Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations

TL;DR

This work addresses the grounding gap in object-centric representations by introducing Neural Slot Interpreter (NSI), a co-training framework that grounds object concepts into emergent slots via a nested object-centric schema. NSI learns scene and schema representations with a bi-level architecture and a contrastive objective that aligns slot embeddings with schema primitives, enabling flexible, many-to-one grounding beyond traditional one-slot-one-object mappings. Across synthetic and real-world datasets, NSI improves grounding accuracy, object discovery, and downstream few-shot reasoning, outperforming bounding-box grounded approaches and plain slot-based methods, while remaining data-efficient. The approach demonstrates that grounded, interpretable slot tokens can serve as effective visual substrates for downstream tasks and motivates multimodal extensions of object-centric grounding.

Abstract

Several accounts of human cognition posit that our intelligence is rooted in our ability to form abstract composable concepts, ground them in our environment, and reason over these grounded entities. This trifecta of human thought has remained elusive in modern intelligent machines. In this work, we investigate whether slot representations extracted from visual scenes serve as appropriate compositional abstractions for grounding and reasoning. We present the Neural Slot Interpreter (NSI), which learns to ground object semantics in slots. At the core of NSI is a nested schema that uses simple syntax rules to organize the object semantics of a scene into object-centric schema primitives. Then, the NSI metric learns to ground primitives into slots through a structured contrastive learning objective that reasons over the intermodal alignment. Experiments with a bi-modal object-property and scene retrieval task demonstrate the grounding efficacy and interpretability of correspondences learned by NSI. From a scene representation standpoint, we find that emergent NSI slots that move beyond the image grid by binding to spatial objects facilitate improved visual grounding compared to conventional bounding-box-based approaches. From a data efficiency standpoint, we empirically validate that NSI learns more generalizable representations from a fixed amount of annotation data than the traditional approach. We also show that the grounded slots surpass unsupervised slots in real-world object discovery and scale with scene complexity. Finally, we investigate the downstream efficacy of the grounded slots. Vision Transformers trained on grounding-aware NSI tokenizers using as few as ten tokens outperform patch-based tokens on challenging few-shot classification tasks.
Paper Structure (52 sections, 17 equations, 41 figures, 8 tables, 2 algorithms)

This paper contains 52 sections, 17 equations, 41 figures, 8 tables, 2 algorithms.

Figures (41)

  • Figure 1: NSI abstracts grounded slots from scenes and enhances object discovery, grounding efficacy, and downstream reasoning abilities of slot representations.
  • Figure 2: Description of a real-world scene using the nested schema. The dotted arrows show correspondences between primitives and the objects they annotate.
  • Figure 3: NSI overview. NSI augments object-centric learning autoencoders with a contrastive learning objective over a batch of scene-schema pairs. A DINOSAUR backbone seitzer2023bridging extracts slot representations $\mathcal{S}_x^{1:K}$ from a batch of scenes and a schema encoder extracts neural primitives $\mathcal{Z}_y^{1:N}$ from their corresponding schema pair. The slots are then passed to a decoder for reconstruction and the slot-primitive neural pairs are passed to the contrastive learning objective.
  • Figure 4: NSI method. (a) A DINOSAUR encoder seitzer2023bridging learns to represent images via slots. (b) A bi-level schema encoder learns a representation of schema primitives. The primitive encoder embeds the object properties of each schema primitive. Then, a Transformer learns embeddings that assimilate the entire schema context. (c) The inner loop of the metric computes the score $S_{xy}$ between compositional abstractions of an image $I_x$ and a schema $P_y$. Object slots and schema primitives are projected onto a shared embedding space and every latent primitive is assigned to its nearest slot for score aggregation. (d) The $S_{xy}$ scores obtained from local entities are used to optimize a global contrastive learning objective in the outer loop over a batch of image-schema pairs.
  • Figure 5: Retrieval results. (a), (b), (c) Property and scene retrieval results. We report Recall$@1/5$ (higher is better). The standard deviation (over five seeds) was $<0.3$ across all model instances and retrieval tasks. The lighter shade shows Recall@5 while the darker shade shows Recall@1. (d), (e), (f) Visualization of correspondences learned by the NSI similarity metric. The colored arrows show the respective correspondences of schema primitives to the slots. Each schema instance is chunked and color-coded by the slot to which its primitives are assigned.
  • ...and 36 more figures