Evaluating Object-Centric Models beyond Object Discovery
Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth
TL;DR
This work targets a core gap in object-centric learning by proposing scalable, downstream-focused evaluation beyond object discovery. It leverages instruction-tuned vision-language models as zero-shot evaluators to assess how well OCL representations support broad visual reasoning across VQA benchmarks, introducing a unified attribution-aware metric (AwGA) and an enhanced grounding dataset (eGQA) to jointly measure localization and representational usefulness. Through extensive experiments, the authors show that multi-feature reconstruction (mFRESA) improves downstream utility and that object-discovery metrics poorly predict VQA performance, emphasizing the need for joint evaluation. The findings suggest that current OCL methods are competitive with strong baselines in some perception tasks but lag in compositional and robust reasoning, highlighting directions for building stronger object-centric representations and more reliable evaluation frameworks with practical impact for multimodal systems.
Abstract
Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.
