Spanning Tree Autoregressive Visual Generation
Sangkyu Lee, Changho Lee, Janghoon Han, Hosung Song, Tackgeun You, Hwasup Lim, Stanley Jungkyu Choi, Honglak Lee, Youngjae Yu
TL;DR
STAR addresses the limitation of fixed raster-scan orders in autoregressive visual generation by enabling bidirectional context through a structured randomized sequence. It defines a sequence $\tau$ drawn from the BFS traversal of a uniformly sampled spanning tree on the image's token lattice, and trains with $p_\theta(x) = \mathbb{E}_{\tau \sim \mathcal{T}} [ \prod_{i=1}^N p_\theta(x_{\tau_i}|x_{\tau_{<i}}) ]$, effectively balancing prior knowledge with flexibility. It demonstrates competitive class-conditional generation and superior postfix completion for image editing on ImageNet-1k with minimal architectural changes, and provides a scalable building block for future AR-based visual modeling. This method has practical impact for real-world image editing and multimodal systems by enabling flexible inference-time sequence orders without sacrificing sampling quality.
Abstract
We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to accommodate image editing at inference. Approaches that expose randomly permuted sequence orders to conventional autoregressive (AR) models in visual generation for bidirectional context either suffer from a decline in performance or compromise the flexibility in sequence order choice at inference. Instead, STAR utilizes traversal orders of uniform spanning trees sampled in a lattice defined by the positions of image patches. Traversal orders are obtained through breadth-first search, allowing us to efficiently construct a spanning tree whose traversal order ensures that the connected partial observation of the image appears as a prefix in the sequence through rejection sampling. Through the tailored yet structured randomized strategy compared to random permutation, STAR preserves the capability of postfix completion while maintaining sampling performance without any significant changes to the model architecture widely adopted in the language AR modeling.
