Table of Contents
Fetching ...

REOrdering Patches Improves Vision Models

Declan Kutscher, David M. Chan, Yutong Bai, Trevor Darrell, Ritwik Gupta

TL;DR

This work shows that patch ordering critically influences performance in long-sequence vision models, challenging the assumption that self-attention’s permutation invariance obviates ordering effects. It introduces REOrder, a two-stage framework that first derives an information-theoretic prior over patch sequences and then learns a task-specific patch permutation via a Plackett-Luce policy optimized with REINFORCE. Across ViT, Transformer-XL, Longformer, and Mamba-ARM on ImageNet-1K and Functional Map of the World, REOrder yields improvements up to 3.01% and 13.35% respectively, demonstrating robust gains across architectures and datasets. The approach provides a practical, plug-in method to mitigate ordering biases in long-sequence vision models and suggests future directions such as dynamic, per-image sequencing and scaling to ultra-long sequences.

Abstract

Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.

REOrdering Patches Improves Vision Models

TL;DR

This work shows that patch ordering critically influences performance in long-sequence vision models, challenging the assumption that self-attention’s permutation invariance obviates ordering effects. It introduces REOrder, a two-stage framework that first derives an information-theoretic prior over patch sequences and then learns a task-specific patch permutation via a Plackett-Luce policy optimized with REINFORCE. Across ViT, Transformer-XL, Longformer, and Mamba-ARM on ImageNet-1K and Functional Map of the World, REOrder yields improvements up to 3.01% and 13.35% respectively, demonstrating robust gains across architectures and datasets. The approach provides a practical, plug-in method to mitigate ordering biases in long-sequence vision models and suggests future directions such as dynamic, per-image sequencing and scaling to ultra-long sequences.

Abstract

Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.

Paper Structure

This paper contains 51 sections, 2 theorems, 24 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Proposition 3.1

For every permutation matrix $\mathbf{P}\in\{0,1\}^{n\times n}$,

Figures (8)

  • Figure 1: Visualizations of alternate patch sequence orderings. Six different patch orders---row-major, column-major, Hilbert curve, spiral, diagonal, and snake---are shown as trajectories over a $14\times14$ grid of patches. Each trajectory begins at the red dot and progresses to the black dot, illustrating the 1-D ordering imposed on the 2-D patch grid.
  • Figure 2: Patch order affects the performance of long-sequence models. This figure compares the top-1 accuracy of Vision Transformer (ViT), Longformer, Mamba, and Transformer-XL (T-XL) on ImageNet-1K and Functional Map of the World when using alternate patch orderings, relative to their standard row-major performance. As expected, ViT remains equivariant to patch sequence permutations. In contrast, long-sequence models exhibit substantial performance variability depending on the patch ordering. No single ordering consistently outperforms others across models or datasets, necessitating dynamic patch ordering strategies.
  • Figure 3: Compression of 1-D sequences can serve as a weak prior for optimal patch ordering. Top-1 accuracy is compared to percentage reduction for different patch orderings across four models for both ImageNet-1K and FMoW.
  • Figure 4: The logits of the Plackett-Luce model, and therefore the permutation order, changes over the course of training. Longformer is initialized with column and row-major patch ordering and optimized with REOrder on ImageNet-1K. The image is of the class "keyboard." We track two patches over the course of the policy curriculum: a keyboard key (light red arrow) and an irrelevant orange beak (dark red arrow). As the policy learns to order patches, we see the patches move toward the back of the sequence (i.e., are back-loaded) reflecting the dataset's center bias.
  • Figure 5: REOrder finds improvements over the best patch ordering prior in almost all cases. Across all models, REOrder can find a better patch ordering than a static prior and improve accuracy across both ImageNet-1K and Functional Map of the World.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Proposition 3.1: Permutation equivariance of self-attention
  • Theorem 1: Permutation equivariance of self‑attention
  • proof