Table of Contents
Fetching ...

MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering

Lixuan He, Shikang Zheng, Linfeng Zhang

TL;DR

Autoregressive image generation over a flat token vocabulary ($N$ tokens) ignores the semantic structure of token embeddings, causing high prediction entropy and inefficiency. MASC constructs a manifold-aligned semantic hierarchy from the codebook using a centroid-free distance and density-driven agglomerative clustering to produce a structure-aware prior, enabling the AR model to predict coarse indices with $p(z^b_t|z^b_{<t})$ instead of individual tokens. It accelerates training by up to 57% and reduces FID from $2.87$ to $2.58$ on LlamaGen-XL, while generalizing across tokenizers and AR frameworks, demonstrating the practical value of structuring the prediction space. This work suggests that architectural advances must be complemented by principled priors that respect the geometric and statistical structure of discrete representations for scalable, high-fidelity autoregressive generation.

Abstract

Autoregressive (AR) models have shown great promise in image generation, yet they face a fundamental inefficiency stemming from their core component: a vast, unstructured vocabulary of visual tokens. This conventional approach treats tokens as a flat vocabulary, disregarding the intrinsic structure of the token embedding space where proximity often correlates with semantic similarity. This oversight results in a highly complex prediction task, which hinders training efficiency and limits final generation quality. To resolve this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic structure. MASC employs a novel geometry-aware distance metric and a density-driven agglomerative construction to model the underlying manifold of the token embeddings. By transforming the flat, high-dimensional prediction task into a structured, hierarchical one, MASC introduces a beneficial inductive bias that significantly simplifies the learning problem for the AR model. MASC is designed as a plug-and-play module, and our extensive experiments validate its effectiveness: it accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58. MASC elevates existing AR frameworks to be highly competitive with state-of-the-art methods, establishing that structuring the prediction space is as crucial as architectural innovation for scalable generative modeling.

MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering

TL;DR

Autoregressive image generation over a flat token vocabulary ( tokens) ignores the semantic structure of token embeddings, causing high prediction entropy and inefficiency. MASC constructs a manifold-aligned semantic hierarchy from the codebook using a centroid-free distance and density-driven agglomerative clustering to produce a structure-aware prior, enabling the AR model to predict coarse indices with instead of individual tokens. It accelerates training by up to 57% and reduces FID from to on LlamaGen-XL, while generalizing across tokenizers and AR frameworks, demonstrating the practical value of structuring the prediction space. This work suggests that architectural advances must be complemented by principled priors that respect the geometric and statistical structure of discrete representations for scalable, high-fidelity autoregressive generation.

Abstract

Autoregressive (AR) models have shown great promise in image generation, yet they face a fundamental inefficiency stemming from their core component: a vast, unstructured vocabulary of visual tokens. This conventional approach treats tokens as a flat vocabulary, disregarding the intrinsic structure of the token embedding space where proximity often correlates with semantic similarity. This oversight results in a highly complex prediction task, which hinders training efficiency and limits final generation quality. To resolve this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic structure. MASC employs a novel geometry-aware distance metric and a density-driven agglomerative construction to model the underlying manifold of the token embeddings. By transforming the flat, high-dimensional prediction task into a structured, hierarchical one, MASC introduces a beneficial inductive bias that significantly simplifies the learning problem for the AR model. MASC is designed as a plug-and-play module, and our extensive experiments validate its effectiveness: it accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58. MASC elevates existing AR frameworks to be highly competitive with state-of-the-art methods, establishing that structuring the prediction space is as crucial as architectural innovation for scalable generative modeling.

Paper Structure

This paper contains 27 sections, 5 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: MASC demonstrates strong capabilities in high-fidelity and diverse image generation.
  • Figure 2: Overview of the MASC-integrated autoregressive generation pipeline, and the bottom row showcases high-fidelity images generated by MASC. Stage 1: A standard image tokenizer is trained, yielding a finalized codebook of visual tokens. MASC Preprocessing: Our proposed MASC framework takes this codebook and constructs a hierarchical semantic tree, producing a mapping function ($\mathcal{M}$) from fine-grained tokens to coarse-grained semantic branches. Stage 2: The autoregressive Transformer is then trained on these simplified branch indices.
  • Figure 3: Conceptual illustration of MASC versus k-means. Left: Visual tokens reside on a semantic manifold where geodesic distance is a better similarity measure than Euclidean distance (dashed line). Right: Geometry-agnostic k-means produces incoherent clusters, whereas MASC's manifold-aligned approach correctly captures the intrinsic data structure and preserves semantic integrity.
  • Figure 4: Training dynamics of LlamaGen-L on ImageNet. The plots compare the FID (left, lower is better) and IS (right, higher is better) scores over training epochs for the Vanilla baseline, the + k-means variant, and our + MASC method. The MASC-enhanced model demonstrates a faster convergence rate, reaching the baseline's final performance in approximately half the training time, and ultimately converging to a much better result. This visualizes the training acceleration benefit of operating in a low-entropy prediction space.
  • Figure 5: A conceptual illustration of linkage criteria. (a) Single-linkage may incorrectly merge two distinct semantic groups if they are connected by a bridge of a few close points. (b) Average-linkage, as used in MASC, considers the overall distribution of points and is more robust, correctly identifying distinct clusters.
  • ...and 1 more figures