Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations
Bhishma Dedhia, Niraj K. Jha
TL;DR
This work addresses the grounding gap in object-centric representations by introducing Neural Slot Interpreter (NSI), a co-training framework that grounds object concepts into emergent slots via a nested object-centric schema. NSI learns scene and schema representations with a bi-level architecture and a contrastive objective that aligns slot embeddings with schema primitives, enabling flexible, many-to-one grounding beyond traditional one-slot-one-object mappings. Across synthetic and real-world datasets, NSI improves grounding accuracy, object discovery, and downstream few-shot reasoning, outperforming bounding-box grounded approaches and plain slot-based methods, while remaining data-efficient. The approach demonstrates that grounded, interpretable slot tokens can serve as effective visual substrates for downstream tasks and motivates multimodal extensions of object-centric grounding.
Abstract
Several accounts of human cognition posit that our intelligence is rooted in our ability to form abstract composable concepts, ground them in our environment, and reason over these grounded entities. This trifecta of human thought has remained elusive in modern intelligent machines. In this work, we investigate whether slot representations extracted from visual scenes serve as appropriate compositional abstractions for grounding and reasoning. We present the Neural Slot Interpreter (NSI), which learns to ground object semantics in slots. At the core of NSI is a nested schema that uses simple syntax rules to organize the object semantics of a scene into object-centric schema primitives. Then, the NSI metric learns to ground primitives into slots through a structured contrastive learning objective that reasons over the intermodal alignment. Experiments with a bi-modal object-property and scene retrieval task demonstrate the grounding efficacy and interpretability of correspondences learned by NSI. From a scene representation standpoint, we find that emergent NSI slots that move beyond the image grid by binding to spatial objects facilitate improved visual grounding compared to conventional bounding-box-based approaches. From a data efficiency standpoint, we empirically validate that NSI learns more generalizable representations from a fixed amount of annotation data than the traditional approach. We also show that the grounded slots surpass unsupervised slots in real-world object discovery and scale with scene complexity. Finally, we investigate the downstream efficacy of the grounded slots. Vision Transformers trained on grounding-aware NSI tokenizers using as few as ten tokens outperform patch-based tokens on challenging few-shot classification tasks.
