Table of Contents
Fetching ...

Zero-Shot Object-Centric Representation Learning

Aniket Didolkar, Andrii Zadaianchuk, Anirudh Goyal, Mike Mozer, Yoshua Bengio, Georg Martius, Maximilian Seitzer

TL;DR

This work assesses zero-shot generalization in object-centric representation learning by introducing a diverse eight-dataset benchmark and analyzing how model choices and training data properties influence transfer. It reveals that training on complex natural data substantially boosts zero-shot performance and that finetuning pre-trained encoders for object discovery, coupled with high-resolution adaptation and efficient decoding, yields state-of-the-art unsupervised object discovery with strong zero-shot transfer. The proposed FT-Dinosaur approach demonstrates robust gains across in-distribution and out-of-distribution datasets, often matching or surpassing in-distribution performance on several tasks. Overall, the paper highlights the importance of task-specific encoder adaptation and dataset realism for building generalizable object-centric foundations and points to data- and model-scale directions for future research.

Abstract

The goal of object-centric representation learning is to decompose visual scenes into a structured representation that isolates the entities. Recent successes have shown that object-centric representation learning can be scaled to real-world scenes by utilizing pre-trained self-supervised features. However, so far, object-centric methods have mostly been applied in-distribution, with models trained and evaluated on the same dataset. This is in contrast to the wider trend in machine learning towards general-purpose models directly applicable to unseen data and tasks. Thus, in this work, we study current object-centric methods through the lens of zero-shot generalization by introducing a benchmark comprising eight different synthetic and real-world datasets. We analyze the factors influencing zero-shot performance and find that training on diverse real-world images improves transferability to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.

Zero-Shot Object-Centric Representation Learning

TL;DR

This work assesses zero-shot generalization in object-centric representation learning by introducing a diverse eight-dataset benchmark and analyzing how model choices and training data properties influence transfer. It reveals that training on complex natural data substantially boosts zero-shot performance and that finetuning pre-trained encoders for object discovery, coupled with high-resolution adaptation and efficient decoding, yields state-of-the-art unsupervised object discovery with strong zero-shot transfer. The proposed FT-Dinosaur approach demonstrates robust gains across in-distribution and out-of-distribution datasets, often matching or surpassing in-distribution performance on several tasks. Overall, the paper highlights the importance of task-specific encoder adaptation and dataset realism for building generalizable object-centric foundations and points to data- and model-scale directions for future research.

Abstract

The goal of object-centric representation learning is to decompose visual scenes into a structured representation that isolates the entities. Recent successes have shown that object-centric representation learning can be scaled to real-world scenes by utilizing pre-trained self-supervised features. However, so far, object-centric methods have mostly been applied in-distribution, with models trained and evaluated on the same dataset. This is in contrast to the wider trend in machine learning towards general-purpose models directly applicable to unseen data and tasks. Thus, in this work, we study current object-centric methods through the lens of zero-shot generalization by introducing a benchmark comprising eight different synthetic and real-world datasets. We analyze the factors influencing zero-shot performance and find that training on diverse real-world images improves transferability to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.
Paper Structure (57 sections, 3 equations, 18 figures, 8 tables)

This paper contains 57 sections, 3 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Evaluating zero-shot transfer of object-centric representations. Performance given in FG-ARI, see \ref{['app:fig:benchmark-mbo']} for corresponding plots with mBO. (a): performance of current object-centric models trained on the Coco dataset. (b): performance of the Dinosaur method seitzer2023bridging with different training datasets. (c): scaling behavior of Dinosaur training on differently sized subsets of Coco.
  • Figure 2: Overview of our method "FT-Dinosaur".➀ Object-Centric Finetuning: starting from Dinov2, the encoder is finetuned for the task of object discovery on the Coco dataset. ➁ High-Res Adaptation: the model is further adapted to high-resolution images. ➂ Zero-Shot Transfer: at test time, we apply the trained model to 8 datasets from our proposed zero-shot benchmark (\ref{['subsec:zero-shot-benchmark']}).
  • Figure 3: Visualization of encoder features in Dinosaur (frozen Dinov2 features) and for features adapted with object-centric finetuning. We show the 1st to 3rd PCA components visualized by different RGB channels (second column). The last column shows scene decomposition masks by each method. More examples and additional PCA components are shown in \ref{['app:fig:analysis-appendix']}.
  • Figure 4: Normalized performance when adding finetuning to Dinosaur for in-distribution training, using a ViT-S/14 Dinov2 encoder. Finetuning shows strong gains on all datasets. Numerical results in \ref{['app:tab:eval-finetuning-in-distribution']}.
  • Figure 5: Comparing in-distribution training vs.zero-shot transfer from Coco for our finetuning approach. Overall, performance is similar. Numerical results in \ref{['app:tab:eval-finetuning-in-distribution']}.
  • ...and 13 more figures