CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

Tim Lebailly; Thomas Stegmüller; Behzad Bozorgtabar; Jean-Philippe Thiran; Tinne Tuytelaars

CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

Tim Lebailly, Thomas Stegmüller, Behzad Bozorgtabar, Jean-Philippe Thiran, Tinne Tuytelaars

TL;DR

CrIBo tackles the challenge that global bootstrapping in self-supervised learning entangles object representations in scene-centric images. It introduces cross-image object-level bootstrapping with semantically coherent region extraction, memory-based cross-image matching, and cycle-consistent positives, all trained with a teacher-student framework and three complementary self-supervised losses. The approach yields strong in-context learning performance and competitive downstream segmentation, while remaining effective on scene-centric data. Overall, CrIBo provides a robust, end-to-end solution for dense representation learning with practical impact on dense retrieval and segmentation tasks.

Abstract

Leveraging nearest neighbor retrieval for self-supervised representation learning has proven beneficial with object-centric images. However, this approach faces limitations when applied to scene-centric datasets, where multiple objects within an image are only implicitly captured in the global representation. Such global bootstrapping can lead to undesirable entanglement of object representations. Furthermore, even object-centric datasets stand to benefit from a finer-grained bootstrapping approach. In response to these challenges, we introduce a novel Cross-Image Object-Level Bootstrapping method tailored to enhance dense visual representation learning. By employing object-level nearest neighbor bootstrapping throughout the training, CrIBo emerges as a notably strong and adequate candidate for in-context learning, leveraging nearest neighbor retrieval at test time. CrIBo shows state-of-the-art performance on the latter task while being highly competitive in more standard downstream segmentation tasks. Our code and pretrained models are publicly available at https://github.com/tileb1/CrIBo.

CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

TL;DR

Abstract

Paper Structure (51 sections, 17 equations, 6 figures, 12 tables)

This paper contains 51 sections, 17 equations, 6 figures, 12 tables.

Introduction
Related works
Method
Preliminaries
Dense representation.
Local representation.
Object representation.
Global representation.
Bootstrapping
Object-level Cross-image Bootstrapping (CrIBo)
Semantically coherent image regions
Joint-space clustering.
Cross-image object matchings
Cycle-consistent matchings.
Self-supervised training objectives
...and 36 more sections

Figures (6)

Figure 1: Positioning of CriBO in the landscape of self-supervised learning.a) Illustration of the cross-image self-supervision concept. c) Depiction of object-level self-supervision. b) CrIBo benefits from both learning paradigms.
Figure 2: High-level overview of cross-image object-level bootstrapping (CrIBo). Given an encoder $f$ and pair of augmented views $\tilde{\boldsymbol{x}}_1$ and $\tilde{\boldsymbol{x}}_2$, object representations $\boldsymbol{c}_i^k$ (depicted as colored object masks) from each view $i$ are computed. Using a memory bank, the nearest neighbors of each object representation $\boldsymbol{c}_1^k$ from the first view are retrieved. A self-supervised consistency loss (depicted as colored arrows) is then enforced between $\boldsymbol{c}_2^k$ and its corresponding retrieved neighbor from the other view $\texttt{nn}(\boldsymbol{c}_1^k)$.
Figure 3: Illustration of cycle-consistent matchings. Such matchings are invariant to data augmentations and reciprocal, loosely speaking.
Figure 4: Visualization of cross-image object-level matchings on COCO. For a given query view (considered as view 1 here), object representations are computed via clustering. For each object-representation $\boldsymbol{c}_1^k$ (highlighted in unique colors), its nearest neighbor $\texttt{nn}(\boldsymbol{c}_1^k, \mathcal{M}_1)$ in the memory bank $\mathcal{M}_1$ is visualized. In this visualization, $K=12$.
Figure 5: Visualization of cross-image object-level matchings on ImageNet-1k. For a given a query view (considered as view 1 here), object representations are computed via clustering. For each object-representation $\boldsymbol{c}_1^k$ (highlighted in unique colors), its nearest neighbor $\texttt{nn}(\boldsymbol{c}_1^k, \mathcal{M}_1)$ in the memory bank $\mathcal{M}_1$ is visualized. In this visualization, $K=12$.
...and 1 more figures

CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

TL;DR

Abstract

CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping

Authors

TL;DR

Abstract

Table of Contents

Figures (6)