Table of Contents
Fetching ...

Grouped Discrete Representation for Object-Centric Learning

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

TL;DR

The paper tackles the limitation of scalar-discrete representations in Object-Centric Learning (OCL) by introducing Grouped Discrete Representation (GDR), which decomposes features into $g$ attribute groups and uses tuple indexes to discretize them, enabling attribute-level similarities to guide learning. It further introduces an invertible channel projection mechanism using a learnable $W$ and its pseudo-inverse to organize channels for grouping, plus a residual pathway with annealing to preserve information. GDR is designed to be compatible with both Transformer-based and Diffusion-based OCL frameworks and demonstrates improved convergence and generalization across diverse image and video benchmarks, with enhanced object separability and interpretability. Ablation studies reveal that a moderate number of groups and an adequate channel expansion rate, together with the invertible projection and training tricks (residual, annealing, normalization), are key to maximizing gains. The findings suggest GDR is a practical, extensible enhancement for VAE-based OCL that can generalize to other VAE-guided tasks and improve attribute-level representations without requiring external supervision.

Abstract

Object-Centric Learning (OCL) aims to discover objects in images or videos by reconstructing the input. Representative methods achieve this by reconstructing the input as its Variational Autoencoder (VAE) discrete representations, which suppress (super-)pixel noise and enhance object separability. However, these methods treat features as indivisible units, overlooking their compositional attributes, and discretize features via scalar code indexes, losing attribute-level similarities and differences. We propose Grouped Discrete Representation (GDR) for OCL. For better generalization, features are decomposed into combinatorial attributes by organized channel grouping. For better convergence, features are quantized into discrete representations via tuple code indexes. Experiments demonstrate that GDR consistently improves both mainstream and state-of-the-art OCL methods across various datasets. Visualizations further highlight GDR's superior object separability and interpretability. The source code is available on https://github.com/Genera1Z/GroupedDiscreteRepresentation.

Grouped Discrete Representation for Object-Centric Learning

TL;DR

The paper tackles the limitation of scalar-discrete representations in Object-Centric Learning (OCL) by introducing Grouped Discrete Representation (GDR), which decomposes features into attribute groups and uses tuple indexes to discretize them, enabling attribute-level similarities to guide learning. It further introduces an invertible channel projection mechanism using a learnable and its pseudo-inverse to organize channels for grouping, plus a residual pathway with annealing to preserve information. GDR is designed to be compatible with both Transformer-based and Diffusion-based OCL frameworks and demonstrates improved convergence and generalization across diverse image and video benchmarks, with enhanced object separability and interpretability. Ablation studies reveal that a moderate number of groups and an adequate channel expansion rate, together with the invertible projection and training tricks (residual, annealing, normalization), are key to maximizing gains. The findings suggest GDR is a practical, extensible enhancement for VAE-based OCL that can generalize to other VAE-guided tasks and improve attribute-level representations without requiring external supervision.

Abstract

Object-Centric Learning (OCL) aims to discover objects in images or videos by reconstructing the input. Representative methods achieve this by reconstructing the input as its Variational Autoencoder (VAE) discrete representations, which suppress (super-)pixel noise and enhance object separability. However, these methods treat features as indivisible units, overlooking their compositional attributes, and discretize features via scalar code indexes, losing attribute-level similarities and differences. We propose Grouped Discrete Representation (GDR) for OCL. For better generalization, features are decomposed into combinatorial attributes by organized channel grouping. For better convergence, features are quantized into discrete representations via tuple code indexes. Experiments demonstrate that GDR consistently improves both mainstream and state-of-the-art OCL methods across various datasets. Visualizations further highlight GDR's superior object separability and interpretability. The source code is available on https://github.com/Genera1Z/GroupedDiscreteRepresentation.

Paper Structure

This paper contains 13 sections, 10 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Non-grouped vs grouped discrete representation. (upper) Existing methods treat features as units, selecting template features from a codebook by scalar indexes to discretize superpixels. (lower) We treat attributes as units, selecting template attributes from a grouped codebook by tuple indexes.
  • Figure 2: Our GDR is applicable to mainstream OCL. First row: architectures of Transformer-based (left) and Diffusion-based (right) methods. Second row: non-grouped representation discretization in dVAE (left), non-grouped discretization in VQ-VAE (right), and grouped discretization (center) of our method.
  • Figure 3: Object discovery visualization of SLATE and SlotDiffusion plus GDR.
  • Figure 4: GDR's invertible projection learns to organize channels' orders for grouped discretization. Every sub-plot has three columns of channels (black bars) and matrix weights among them (grey ribbons). The first column corresponds to continuous representation channels. Ribbons between the first and second columns are the project-up weights. The second column is discretization attribute groups. Ribbons between the second and third columns are the project-down weights. The third column is discretized representation channels.
  • Figure 5: GDR boosts object discovery performance of both Transformer- (top) and Diffusion-based (bottom) methods on images (left) and videos (right). A naive CNN is used as their primary encoder. Titles are datasets; x ticks are metrics while y ticks are metric values in adaptive scope. Higher values are better.
  • ...and 5 more figures