MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering
Lixuan He, Shikang Zheng, Linfeng Zhang
TL;DR
Autoregressive image generation over a flat token vocabulary ($N$ tokens) ignores the semantic structure of token embeddings, causing high prediction entropy and inefficiency. MASC constructs a manifold-aligned semantic hierarchy from the codebook using a centroid-free distance and density-driven agglomerative clustering to produce a structure-aware prior, enabling the AR model to predict coarse indices with $p(z^b_t|z^b_{<t})$ instead of individual tokens. It accelerates training by up to 57% and reduces FID from $2.87$ to $2.58$ on LlamaGen-XL, while generalizing across tokenizers and AR frameworks, demonstrating the practical value of structuring the prediction space. This work suggests that architectural advances must be complemented by principled priors that respect the geometric and statistical structure of discrete representations for scalable, high-fidelity autoregressive generation.
Abstract
Autoregressive (AR) models have shown great promise in image generation, yet they face a fundamental inefficiency stemming from their core component: a vast, unstructured vocabulary of visual tokens. This conventional approach treats tokens as a flat vocabulary, disregarding the intrinsic structure of the token embedding space where proximity often correlates with semantic similarity. This oversight results in a highly complex prediction task, which hinders training efficiency and limits final generation quality. To resolve this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic structure. MASC employs a novel geometry-aware distance metric and a density-driven agglomerative construction to model the underlying manifold of the token embeddings. By transforming the flat, high-dimensional prediction task into a structured, hierarchical one, MASC introduces a beneficial inductive bias that significantly simplifies the learning problem for the AR model. MASC is designed as a plug-and-play module, and our extensive experiments validate its effectiveness: it accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58. MASC elevates existing AR frameworks to be highly competitive with state-of-the-art methods, establishing that structuring the prediction space is as crucial as architectural innovation for scalable generative modeling.
