Table of Contents
Fetching ...

The Collapse of Patches

Wei Guo, Shunqi Mao, Zhuonan Liang, Heng Wang, Weidong Cai

TL;DR

Many vision models assume uniform patch dependencies in masked image modeling. The authors propose patch collapse by learning a Collapse Masked Autoencoder (CoMAE) that yields a collapse order via a patch dependency graph and PageRank. They show that supervising autoregressive generation (CMAR) and Vision Transformers (CViT) with the collapse order improves generation quality and reduces computation, achieving accurate classification with as little as 22 percent of the image. The work demonstrates a new efficiency oriented perspective on image modeling and provides a foundation for efficient, scalable vision systems.

Abstract

Observing certain patches in an image reduces the uncertainty of others. Their realization lowers the distribution entropy of each remaining patch feature, analogous to collapsing a particle's wave function in quantum mechanics. This phenomenon can intuitively be called patch collapse. To identify which patches are most relied on during a target region's collapse, we learn an autoencoder that softly selects a subset of patches to reconstruct each target patch. Graphing these learned dependencies for each patch's PageRank score reveals the optimal patch order to realize an image. We show that respecting this order benefits various masked image modeling methods. First, autoregressive image generation can be boosted by retraining the state-of-the-art model MAR. Next, we introduce a new setup for image classification by exposing Vision Transformers only to high-rank patches in the collapse order. Seeing 22\% of such patches is sufficient to achieve high accuracy. With these experiments, we propose patch collapse as a novel image modeling perspective that promotes vision efficiency. Our project is available at https://github.com/wguo-ai/CoP .

The Collapse of Patches

TL;DR

Many vision models assume uniform patch dependencies in masked image modeling. The authors propose patch collapse by learning a Collapse Masked Autoencoder (CoMAE) that yields a collapse order via a patch dependency graph and PageRank. They show that supervising autoregressive generation (CMAR) and Vision Transformers (CViT) with the collapse order improves generation quality and reduces computation, achieving accurate classification with as little as 22 percent of the image. The work demonstrates a new efficiency oriented perspective on image modeling and provides a foundation for efficient, scalable vision systems.

Abstract

Observing certain patches in an image reduces the uncertainty of others. Their realization lowers the distribution entropy of each remaining patch feature, analogous to collapsing a particle's wave function in quantum mechanics. This phenomenon can intuitively be called patch collapse. To identify which patches are most relied on during a target region's collapse, we learn an autoencoder that softly selects a subset of patches to reconstruct each target patch. Graphing these learned dependencies for each patch's PageRank score reveals the optimal patch order to realize an image. We show that respecting this order benefits various masked image modeling methods. First, autoregressive image generation can be boosted by retraining the state-of-the-art model MAR. Next, we introduce a new setup for image classification by exposing Vision Transformers only to high-rank patches in the collapse order. Seeing 22\% of such patches is sufficient to achieve high accuracy. With these experiments, we propose patch collapse as a novel image modeling perspective that promotes vision efficiency. Our project is available at https://github.com/wguo-ai/CoP .

Paper Structure

This paper contains 28 sections, 1 theorem, 15 equations, 13 figures, 2 tables.

Key Result

Theorem 1

Let $\mathbf{A} \in [0,1]^{N \times N}$ be the learned dependency matrix with $\mathbf{A}_{ij}$ indicating the influence of patch $j$ on patch $i$. Let $P$ be the corresponding column-stochastic matrix, and let $c \in (0,1)$. Ordering patches in descending order of: minimizes the linearized proxy of $H_c$ at each prefix. If $\beta$ is constant or interpreted as a personalized teleport vector, thi

Figures (13)

  • Figure 1: Patch synthesis in random and collapse orders. We autoregressively generate rooster image patches following random order (above) and collapse order (below). The latter synthesizes prominent rooster features and reduces image uncertainty more effectively.
  • Figure 2: Pipeline overview. Given an image, the CoMAE encoder selects the most influential patches needed to reconstruct each patch, while trivial patches are masked with heavier noise injection. These selection weights form a patch dependency graph on which we compute the PageRank scores to determine the collapse order of patches, where higher-rank patches are less dependent on the rest of the image. Finally, we use this ranking to supervise image generation and classification tasks to follow the correct patch processing order.
  • Figure 3: Comparison of generators and classifiers. Our generator (CMAR) and classifier (CViT) respect the collapse order.
  • Figure 4: Visualization of collapse order. The left figure shows image patches with different collapse ranks indicated by circle sizes. The right figure connects the top-ranked 64 patches by collapse order. One can observe that top patches outline important shapes in each image.
  • Figure 5: Class-wise collapse order patterns. These heatmaps show sample patch indices sorted in collapse order for each class.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Definition 1: Cumulative conditional entropy
  • Theorem 1: Optimal collapse ranking
  • proof