Grouped Discrete Representation for Object-Centric Learning

Rongzhen Zhao; Vivienne Wang; Juho Kannala; Joni Pajarinen

Grouped Discrete Representation for Object-Centric Learning

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

TL;DR

The paper tackles the limitation of scalar-discrete representations in Object-Centric Learning (OCL) by introducing Grouped Discrete Representation (GDR), which decomposes features into $g$ attribute groups and uses tuple indexes to discretize them, enabling attribute-level similarities to guide learning. It further introduces an invertible channel projection mechanism using a learnable $W$ and its pseudo-inverse to organize channels for grouping, plus a residual pathway with annealing to preserve information. GDR is designed to be compatible with both Transformer-based and Diffusion-based OCL frameworks and demonstrates improved convergence and generalization across diverse image and video benchmarks, with enhanced object separability and interpretability. Ablation studies reveal that a moderate number of groups and an adequate channel expansion rate, together with the invertible projection and training tricks (residual, annealing, normalization), are key to maximizing gains. The findings suggest GDR is a practical, extensible enhancement for VAE-based OCL that can generalize to other VAE-guided tasks and improve attribute-level representations without requiring external supervision.

Abstract

Object-Centric Learning (OCL) aims to discover objects in images or videos by reconstructing the input. Representative methods achieve this by reconstructing the input as its Variational Autoencoder (VAE) discrete representations, which suppress (super-)pixel noise and enhance object separability. However, these methods treat features as indivisible units, overlooking their compositional attributes, and discretize features via scalar code indexes, losing attribute-level similarities and differences. We propose Grouped Discrete Representation (GDR) for OCL. For better generalization, features are decomposed into combinatorial attributes by organized channel grouping. For better convergence, features are quantized into discrete representations via tuple code indexes. Experiments demonstrate that GDR consistently improves both mainstream and state-of-the-art OCL methods across various datasets. Visualizations further highlight GDR's superior object separability and interpretability. The source code is available on https://github.com/Genera1Z/GroupedDiscreteRepresentation.

Grouped Discrete Representation for Object-Centric Learning

TL;DR

The paper tackles the limitation of scalar-discrete representations in Object-Centric Learning (OCL) by introducing Grouped Discrete Representation (GDR), which decomposes features into

attribute groups and uses tuple indexes to discretize them, enabling attribute-level similarities to guide learning. It further introduces an invertible channel projection mechanism using a learnable

and its pseudo-inverse to organize channels for grouping, plus a residual pathway with annealing to preserve information. GDR is designed to be compatible with both Transformer-based and Diffusion-based OCL frameworks and demonstrates improved convergence and generalization across diverse image and video benchmarks, with enhanced object separability and interpretability. Ablation studies reveal that a moderate number of groups and an adequate channel expansion rate, together with the invertible projection and training tricks (residual, annealing, normalization), are key to maximizing gains. The findings suggest GDR is a practical, extensible enhancement for VAE-based OCL that can generalize to other VAE-guided tasks and improve attribute-level representations without requiring external supervision.

Grouped Discrete Representation for Object-Centric Learning

TL;DR

Abstract

Grouped Discrete Representation for Object-Centric Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)