Table of Contents
Fetching ...

Are Object-Centric Representations Better At Compositional Generalization?

Ferdinand Kapl, Amir Mohammad Karimi Mamaghan, Maximilian Seitzer, Karl Henrik Johansson, Carsten Marr, Stefan Bauer, Andrea Dittadi

TL;DR

This work introduces a Visual Question Answering benchmark across three controlled visual worlds to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties.

Abstract

Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.

Are Object-Centric Representations Better At Compositional Generalization?

TL;DR

This work introduces a Visual Question Answering benchmark across three controlled visual worlds to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties.

Abstract

Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.
Paper Structure (18 sections, 17 figures, 14 tables)

This paper contains 18 sections, 17 figures, 14 tables.

Figures (17)

  • Figure 1: Compositional Generalization. To increase generalization difficulty, we decrease the number of unique object property combinations that are seen during training. In the conceptual example, each object is defined by its shape and size, which coincides with MOVi-C. Datasets. For each generalization difficulty and base dataset, we generate images and corresponding question--answer pairs by sampling objects with the allowed combinations. Training. We pretrain object-centric (OC) models by reconstructing the self-supervised (Dense) features from pretrained vision encoders. For VQA downstream training, we concatenate the image features (OC: red; Dense: blue) with the fixed text embeddings and train transformer models of various sizes to predict the answer given image and question.
  • Figure 2: ID and COOD VQA accuracies are strongly correlated (Pearson and Spearman $>0.9$; $p<0.01$). End-of-training results for CLEVRTex, Super-CLEVR, and MOVi-C ( easy, medium, hard) across all image representations (small or large point) and downstream models. Oracle is in the top-right (black), and the question-only baseline is in the bottom-left (gray).
  • Figure 3: Object-centric representations are more compute-efficient. COOD VQA accuracy on Super-CLEVR easy (left), medium (middle), and hard (right) versus downstream compute (log FLOPs) for DINOv2-based models; the question-only baseline is dashed gray. Dense representations slightly outperform on easy but do not match OC performance on hard even with 3 $\times$ compute.
  • Figure 4: Object-centric representations are more sample-efficient. COOD VQA accuracy on MOVi-C easy versus training set size for TF 2 (left) and TF 5 (right). The question-only baseline trained on the full dataset is shown in gray. The OC advantage is strongest at smaller sample sizes and especially with TF 2. The dense representations only catch up to, or slightly surpass, them on the full training dataset of 40k samples with TF 5.
  • Figure 5: OC models generalize better at low training diversity. COOD VQA accuracy for DINOv2 and DINOSAURv2 trained on MOVi-C subsets across sample sizes and diversities ( easy-- hard) for TF 2 (left) and TF 5 (right). The question-only baseline (full data) is shown in gray. Dense DINOv2 only overtakes DINOSAURv2 at the largest sample size for easier generalizations. Under lower diversity or fewer data points, DINOSAURv2 generalizes better or as well.
  • ...and 12 more figures