Table of Contents
Fetching ...

Object-Centric Pretraining via Target Encoder Bootstrapping

Nikola Đukić, Tim Lebailly, Tinne Tuytelaars

TL;DR

This paper tackles the upper-bound limitation of object-centric learning imposed by frozen target encoders by introducing OCEBO, a self-distillation framework that pretrains object-centric models from scratch with an EMA-updated target encoder. A key contribution is cross-view patch filtering, which selects informative patches for reconstruction to avoid slot collapse during early training. Empirically, OCEBO trained on COCO-scale data achieves competitive unsupervised object discovery performance compared to models pretrained on hundreds of millions of images and demonstrates scalability with larger COCO-based datasets. The work highlights the importance of injecting object-centric inductive biases into the target encoder and points toward scalable object-centric foundation models, while providing code and pretrained models for reproducibility.

Abstract

Object-centric representation learning has recently been successfully applied to real-world datasets. This success can be attributed to pretrained non-object-centric foundation models, whose features serve as reconstruction targets for slot attention. However, targets must remain frozen throughout the training, which sets an upper bound on the performance object-centric models can attain. Attempts to update the target encoder by bootstrapping result in large performance drops, which can be attributed to its lack of object-centric inductive biases, causing the object-centric model's encoder to drift away from representations useful as reconstruction targets. To address these limitations, we propose Object-CEntric Pretraining by Target Encoder BOotstrapping, a self-distillation setup for training object-centric models from scratch, on real-world data, for the first time ever. In OCEBO, the target encoder is updated as an exponential moving average of the object-centric model, thus explicitly being enriched with object-centric inductive biases introduced by slot attention while removing the upper bound on performance present in other models. We mitigate the slot collapse caused by random initialization of the target encoder by introducing a novel cross-view patch filtering approach that limits the supervision to sufficiently informative patches. When pretrained on 241k images from COCO, OCEBO achieves unsupervised object discovery performance comparable to that of object-centric models with frozen non-object-centric target encoders pretrained on hundreds of millions of images. The code and pretrained models are publicly available at https://github.com/djukicn/ocebo.

Object-Centric Pretraining via Target Encoder Bootstrapping

TL;DR

This paper tackles the upper-bound limitation of object-centric learning imposed by frozen target encoders by introducing OCEBO, a self-distillation framework that pretrains object-centric models from scratch with an EMA-updated target encoder. A key contribution is cross-view patch filtering, which selects informative patches for reconstruction to avoid slot collapse during early training. Empirically, OCEBO trained on COCO-scale data achieves competitive unsupervised object discovery performance compared to models pretrained on hundreds of millions of images and demonstrates scalability with larger COCO-based datasets. The work highlights the importance of injecting object-centric inductive biases into the target encoder and points toward scalable object-centric foundation models, while providing code and pretrained models for reproducibility.

Abstract

Object-centric representation learning has recently been successfully applied to real-world datasets. This success can be attributed to pretrained non-object-centric foundation models, whose features serve as reconstruction targets for slot attention. However, targets must remain frozen throughout the training, which sets an upper bound on the performance object-centric models can attain. Attempts to update the target encoder by bootstrapping result in large performance drops, which can be attributed to its lack of object-centric inductive biases, causing the object-centric model's encoder to drift away from representations useful as reconstruction targets. To address these limitations, we propose Object-CEntric Pretraining by Target Encoder BOotstrapping, a self-distillation setup for training object-centric models from scratch, on real-world data, for the first time ever. In OCEBO, the target encoder is updated as an exponential moving average of the object-centric model, thus explicitly being enriched with object-centric inductive biases introduced by slot attention while removing the upper bound on performance present in other models. We mitigate the slot collapse caused by random initialization of the target encoder by introducing a novel cross-view patch filtering approach that limits the supervision to sufficiently informative patches. When pretrained on 241k images from COCO, OCEBO achieves unsupervised object discovery performance comparable to that of object-centric models with frozen non-object-centric target encoders pretrained on hundreds of millions of images. The code and pretrained models are publicly available at https://github.com/djukicn/ocebo.

Paper Structure

This paper contains 15 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of OCEBO: View $x_1$ is processed by the object-centric model's encoder (top branch), producing global and patch representations $\tilde{{\bm{z}}}_1$ and ${\bm{z}}_1$, respectively. Patch representations are sent through the slot attention encoder and decoder, where the latter outputs a reconstruction of the input patch representations $q_1$. Target encoder (bottom branch) processes both views $x_1$ and $x_2$ separately and produces their global and patch representations $\tilde{{\bm{z}}}_{t, 1}$, ${{\bm{z}}}_{t, 1}$, $\tilde{{\bm{z}}}_{t, 2}$ and ${{\bm{z}}}_{t, 2}$, respectively. Patch representations ${\bm{z}}_{t,1}$ and ${\bm{z}}_{t, 2}$ are used by the cross-view patch filtering approach to infer informative target patches and produce the mask ${\bm{m}}$. The inverse augmentation operation ($\texttt{invaug}$) is applied to the target features of $x_2$ and reconstructions of $x_1$ to make them correspond to the overlapping region (purple part of the image) before combining them with the mask ${\bm{m}}$ and applying the object-centric loss $\mathcal{L}_{oc}$. Global loss $\mathcal{L}_{global}$ is applied to global representations $\tilde{{\bm{z}}}_1$ and $\tilde{{\bm{z}}}_{t, 2}$.
  • Figure 2: The percentage of supervised patches, i.e., those that satisfy the cross-view patch filtering condition throughout the model training. Blue line corresponds to the model trained on COCO, while the orange line corresponds to that trained on COCO+.
  • Figure 3: PCA visualizations of the representations produced by the target encoder of OCEBO and by DINOv2. RGB values correspond to principal components 1--3 or 4--6.
  • Figure 4: Top: Scaling plots for FG-ARI (left) and mBO (right) with dataset sizes of $2^{15}$, $2^{16}$, $2^{17}$ and $2^{18}$ sampled from COCO or COCO+. Bottom: FG-ARI vs. mBO plot where point sizes indicate the dataset size (the smallest point corresponds to $2^{15}$, while the largest corresponds to $2^{18}$.