Table of Contents
Fetching ...

Equivariant Image Modeling

Ruixiao Dong, Mengde Xu, Zigang Geng, Li Li, Han Hu, Shuyang Gu

TL;DR

The paper addresses subtask conflicts in autoregressive image modeling by introducing an equivariant framework that aligns optimization targets across spatial subtasks through translation invariance. It combines column-wise 1D tokenization with windowed causal attention to enforce consistent contextual relationships and enable efficient, long-horizon generation. Empirical results on ImageNet-1k at 256×256 show competitive performance with fewer GFLOPs and improved zero-shot generalization and ultra-long image synthesis, supported by analyses of equivariance and ablations. The work provides a principled, task-aligned decomposition approach and releases code and models to facilitate broader adoption.

Abstract

Current generative models, such as autoregressive and diffusion approaches, decompose high-dimensional data distribution learning into a series of simpler subtasks. However, inherent conflicts arise during the joint optimization of these subtasks, and existing solutions fail to resolve such conflicts without sacrificing efficiency or scalability. We propose a novel equivariant image modeling framework that inherently aligns optimization targets across subtasks by leveraging the translation invariance of natural visual signals. Our method introduces (1) column-wise tokenization which enhances translational symmetry along the horizontal axis, and (2) windowed causal attention which enforces consistent contextual relationships across positions. Evaluated on class-conditioned ImageNet generation at 256x256 resolution, our approach achieves performance comparable to state-of-the-art AR models while using fewer computational resources. Systematic analysis demonstrates that enhanced equivariance reduces inter-task conflicts, significantly improving zero-shot generalization and enabling ultra-long image synthesis. This work establishes the first framework for task-aligned decomposition in generative modeling, offering insights into efficient parameter sharing and conflict-free optimization. The code and models are publicly available at https://github.com/drx-code/EquivariantModeling.

Equivariant Image Modeling

TL;DR

The paper addresses subtask conflicts in autoregressive image modeling by introducing an equivariant framework that aligns optimization targets across spatial subtasks through translation invariance. It combines column-wise 1D tokenization with windowed causal attention to enforce consistent contextual relationships and enable efficient, long-horizon generation. Empirical results on ImageNet-1k at 256×256 show competitive performance with fewer GFLOPs and improved zero-shot generalization and ultra-long image synthesis, supported by analyses of equivariance and ablations. The work provides a principled, task-aligned decomposition approach and releases code and models to facilitate broader adoption.

Abstract

Current generative models, such as autoregressive and diffusion approaches, decompose high-dimensional data distribution learning into a series of simpler subtasks. However, inherent conflicts arise during the joint optimization of these subtasks, and existing solutions fail to resolve such conflicts without sacrificing efficiency or scalability. We propose a novel equivariant image modeling framework that inherently aligns optimization targets across subtasks by leveraging the translation invariance of natural visual signals. Our method introduces (1) column-wise tokenization which enhances translational symmetry along the horizontal axis, and (2) windowed causal attention which enforces consistent contextual relationships across positions. Evaluated on class-conditioned ImageNet generation at 256x256 resolution, our approach achieves performance comparable to state-of-the-art AR models while using fewer computational resources. Systematic analysis demonstrates that enhanced equivariance reduces inter-task conflicts, significantly improving zero-shot generalization and enabling ultra-long image synthesis. This work establishes the first framework for task-aligned decomposition in generative modeling, offering insights into efficient parameter sharing and conflict-free optimization. The code and models are publicly available at https://github.com/drx-code/EquivariantModeling.

Paper Structure

This paper contains 25 sections, 8 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Illustration of Equivariant Image Generation Framework. The tokenizer translates the image into 1D tokens arranged in columns and an enhanced autoregressive model models the column-wise token distribution.
  • Figure 2: Visual Meanings of 1D Tokens. By progressively replacing the randomly initialized token sequence with tokens encoded from the ground truth images, the decoder faithfully reconstructs the original images step by step.
  • Figure 3: Training Loss of Different Models. Left: the training loss of different methods at early (10 epoches) and late (100 epoches) training stage. Right: the relative loss improvement of different methods under different settings compared to the early stage of Multi-task setting. The higher value indicates better performance. The equivariant generation approach can transfer the improvement from a single task to other untrained tasks.
  • Figure 4: Converged Training Loss on ImageNet vs LHQ. Compared to ImageNet, the visual statics in LHQ demonstrates greater uniformity, as does the task-wise loss distribution.
  • Figure 5: Visual examples of long image generation. We present visual examples of long images with arbitrary lengths, which are generated by our model that has been trained on the Places datasets with fixed length of 256.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Definition 2.1