Table of Contents
Fetching ...

Evaluating Object-Centric Models beyond Object Discovery

Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth

TL;DR

This work targets a core gap in object-centric learning by proposing scalable, downstream-focused evaluation beyond object discovery. It leverages instruction-tuned vision-language models as zero-shot evaluators to assess how well OCL representations support broad visual reasoning across VQA benchmarks, introducing a unified attribution-aware metric (AwGA) and an enhanced grounding dataset (eGQA) to jointly measure localization and representational usefulness. Through extensive experiments, the authors show that multi-feature reconstruction (mFRESA) improves downstream utility and that object-discovery metrics poorly predict VQA performance, emphasizing the need for joint evaluation. The findings suggest that current OCL methods are competitive with strong baselines in some perception tasks but lag in compositional and robust reasoning, highlighting directions for building stronger object-centric representations and more reliable evaluation frameworks with practical impact for multimodal systems.

Abstract

Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.

Evaluating Object-Centric Models beyond Object Discovery

TL;DR

This work targets a core gap in object-centric learning by proposing scalable, downstream-focused evaluation beyond object discovery. It leverages instruction-tuned vision-language models as zero-shot evaluators to assess how well OCL representations support broad visual reasoning across VQA benchmarks, introducing a unified attribution-aware metric (AwGA) and an enhanced grounding dataset (eGQA) to jointly measure localization and representational usefulness. Through extensive experiments, the authors show that multi-feature reconstruction (mFRESA) improves downstream utility and that object-discovery metrics poorly predict VQA performance, emphasizing the need for joint evaluation. The findings suggest that current OCL methods are competitive with strong baselines in some perception tasks but lag in compositional and robust reasoning, highlighting directions for building stronger object-centric representations and more reliable evaluation frameworks with practical impact for multimodal systems.

Abstract

Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.
Paper Structure (19 sections, 3 equations, 8 figures, 11 tables)

This paper contains 19 sections, 3 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Issues with disjoint evaluation. Disjoint metrics ignore localization and representation fragmentation. (top) Models M1 and M2 obtain the same classification score despite M1 localizing the object more accurately (localization fragmentation). (bottom) Transformer-based probing for VQA tasks does not attribute answers to specific slots; hence, model M4 (correct answer from correct slot) is scored the same as M3, which answers correctly using the wrong slot (representation fragmentation).
  • Figure 2: Training and evaluation setup. Our training is akin to LLaVA liu2023visual. In Stage I, only the MLP connector is trained on the pre-training dataset. This aligns the slot embeddings with the language model's embedding space. In Stage II, the MLP network and the language model are trained on the instruction-tuning dataset from LLaVA. This enables the language model to follow instructions and perform tasks based on slots as visual tokens. Evaluation is performed in a zero-shot fashion on various VQA benchmarks, where the text is encoded via a text encoder, and images are encoded using the slot-attention model. The evaluation tests how well the connector and LLM networks utilize the slot embeddings for answering the provided questions.
  • Figure 3: Metrics for evaluating OCL models. Accuracy and mIoU evaluate representation usefulness and localization separately, while grounded accuracy allows for a joint evaluation, addressing localization fragmentation but overlooks representation fragmentation. Our proposed AwGA jointly evaluates both and penalizes both fragmentation types.
  • Figure 4: Qualitative examples. AwGA scores each example using the image, grounding masks (G. Mask), and the question–answer pair. The need for AwGA is evident: DINOSAURv2 achieves high G-Acc (correct ans. and high mIoU) but low AwGA, since the Top-$K$ answer-attributed slots poorly overlap with the grounded mask. Also, StableLSD predicts the correct answer, but has a low AwGA due to weak grounding overlap.
  • Figure 5: Robustness of AwGA. Spearman's rank correlation for the AwGA metrics for different LLM and connectors designs. AwGA remains stable across different LLMs and connector architectures, suggesting that our VLM-based evaluation is relatively insensitive to specific LLMs or connector architectures.
  • ...and 3 more figures