Table of Contents
Fetching ...

Spanning Tree Autoregressive Visual Generation

Sangkyu Lee, Changho Lee, Janghoon Han, Hosung Song, Tackgeun You, Hwasup Lim, Stanley Jungkyu Choi, Honglak Lee, Youngjae Yu

TL;DR

STAR addresses the limitation of fixed raster-scan orders in autoregressive visual generation by enabling bidirectional context through a structured randomized sequence. It defines a sequence $\tau$ drawn from the BFS traversal of a uniformly sampled spanning tree on the image's token lattice, and trains with $p_\theta(x) = \mathbb{E}_{\tau \sim \mathcal{T}} [ \prod_{i=1}^N p_\theta(x_{\tau_i}|x_{\tau_{<i}}) ]$, effectively balancing prior knowledge with flexibility. It demonstrates competitive class-conditional generation and superior postfix completion for image editing on ImageNet-1k with minimal architectural changes, and provides a scalable building block for future AR-based visual modeling. This method has practical impact for real-world image editing and multimodal systems by enabling flexible inference-time sequence orders without sacrificing sampling quality.

Abstract

We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to accommodate image editing at inference. Approaches that expose randomly permuted sequence orders to conventional autoregressive (AR) models in visual generation for bidirectional context either suffer from a decline in performance or compromise the flexibility in sequence order choice at inference. Instead, STAR utilizes traversal orders of uniform spanning trees sampled in a lattice defined by the positions of image patches. Traversal orders are obtained through breadth-first search, allowing us to efficiently construct a spanning tree whose traversal order ensures that the connected partial observation of the image appears as a prefix in the sequence through rejection sampling. Through the tailored yet structured randomized strategy compared to random permutation, STAR preserves the capability of postfix completion while maintaining sampling performance without any significant changes to the model architecture widely adopted in the language AR modeling.

Spanning Tree Autoregressive Visual Generation

TL;DR

STAR addresses the limitation of fixed raster-scan orders in autoregressive visual generation by enabling bidirectional context through a structured randomized sequence. It defines a sequence drawn from the BFS traversal of a uniformly sampled spanning tree on the image's token lattice, and trains with , effectively balancing prior knowledge with flexibility. It demonstrates competitive class-conditional generation and superior postfix completion for image editing on ImageNet-1k with minimal architectural changes, and provides a scalable building block for future AR-based visual modeling. This method has practical impact for real-world image editing and multimodal systems by enabling flexible inference-time sequence orders without sacrificing sampling quality.

Abstract

We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to accommodate image editing at inference. Approaches that expose randomly permuted sequence orders to conventional autoregressive (AR) models in visual generation for bidirectional context either suffer from a decline in performance or compromise the flexibility in sequence order choice at inference. Instead, STAR utilizes traversal orders of uniform spanning trees sampled in a lattice defined by the positions of image patches. Traversal orders are obtained through breadth-first search, allowing us to efficiently construct a spanning tree whose traversal order ensures that the connected partial observation of the image appears as a prefix in the sequence through rejection sampling. Through the tailored yet structured randomized strategy compared to random permutation, STAR preserves the capability of postfix completion while maintaining sampling performance without any significant changes to the model architecture widely adopted in the language AR modeling.

Paper Structure

This paper contains 23 sections, 4 equations, 16 figures, 4 tables, 2 algorithms.

Figures (16)

  • Figure 1: Comparison of sequence orders in AR visual generation. (a) Conventional AR models follow a fixed raster-scan order, which limits the capability to change the sequence order at inference. (b) AR models trained under random permutation have flexibility on inference sequence order, but random permutation does not reflect prior knowledge, such as center bias and locality. (c) The proposed method, STAR, adopts the traversal order of the uniform spanning tree so that it structurally maintains prior knowledge and flexibility.
  • Figure 2: Overview of the proposed STAR modeling. (a) We perform training and inference according to the sequence order obtained by traversing the uniform sampling tree from the corner using BFS, by additionally feeding the positional embedding for the next token position to the model. (b) We perform rejection sampling to sample a spanning tree with a maximum depth occurring at the boundary of the non-masked region, allowing postfix completion when partial observations of the image are provided for image editing.
  • Figure 3: Prediction entropy of tokens according to token position from the model trained with randomly permuted sequence orders. Higher token entropy in central regions indicates greater token diversity compared to corner areas. The evidence that such differences exist suggests a potential hierarchical structure across the position, affecting the differences in the difficulty of prediction.
  • Figure 4: Conditional prediction entropy according to the minimum Manhattan distance from the prefix token positions. Tokens located at closer distances tend to exhibit lower entropy, whereas those farther away show higher entropy. This suggests that the locality structure of images is partially preserved at the tokens, implying that adjacent tokens should be prioritized in prediction.
  • Figure 5: Examples of sampled results from STAR-XXL in the class-conditional image generation on ImageNet-1k at $256 \times 256$.
  • ...and 11 more figures