Vector-Quantized Vision Foundation Models for Object-Centric Learning

Rongzhen Zhao; Vivienne Wang; Juho Kannala; Joni Pajarinen

Vector-Quantized Vision Foundation Models for Object-Centric Learning

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

TL;DR

The paper tackles object-centric learning (OCL) by addressing texture-heavy object segmentation through a unified approach: Vector-Quantized Vision Foundation Models for OCL (VVO). It introduces shared quantized reconstruction targets derived from the same Vision Foundation Model features to supervise aggregation, enabling direct use of VFM features for slot formation and benefiting a range of decoders (mixture, auto-regressive, diffusion). Across synthetic and real datasets, VVO consistently improves object-discovery metrics and downstream prediction/reasoning on Physion, with ablations showing the advantages of shared VFM encoding and the proposed Q-quantization tricks. The work provides both theoretical insight and practical gains, and releases the code and checkpoints to serve as a general testbed for integrating VFMs with OCL.

Abstract

Object-Centric Learning (OCL) aggregates image or video feature maps into object-level feature vectors, termed \textit{slots}. It's self-supervision of reconstructing the input from slots struggles with complex object textures, thus Vision Foundation Model (VFM) representations are used as the aggregation input and reconstruction target. Existing methods leverage VFM representations in diverse ways yet fail to fully exploit their potential. In response, we propose a unified architecture, Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO). The key to our unification is simply shared quantizing VFM representations in OCL aggregation and decoding. Experiments show that across different VFMs, aggregators and decoders, our VVO consistently outperforms baselines in object discovery and recognition, as well as downstream visual prediction and reasoning. We also mathematically analyze why VFM representations facilitate OCL aggregation and why their shared quantization as reconstruction targets strengthens OCL supervision. Our source code and model checkpoints are available on https://github.com/Genera1Z/VQ-VFM-OCL.

Vector-Quantized Vision Foundation Models for Object-Centric Learning

TL;DR

Abstract

Vector-Quantized Vision Foundation Models for Object-Centric Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)