Table of Contents
Fetching ...

Semantic-Aware Prefix Learning for Token-Efficient Image Generation

Qingfeng Li, Haoxian Zhang, Xu He, Songlin Tang, Zhixue Fang, Xiaoqiang Liu, Pengfei Wan Guoqi Li

Abstract

Visual tokenizers play a central role in latent image generation by bridging high-dimensional images and tractable generative modeling. However, most existing tokenizers are still trained with reconstruction-dominated objectives, which often yield latent representations that are only weakly grounded in high-level semantics. Recent approaches improve semantic alignment, but typically treat semantic signals as auxiliary regularization rather than making them functionally necessary for representation learning. We propose SMAP, a SeMantic-Aware Prefix tokenizer that injects class-level semantic conditions into a query-based 1D tokenization framework. To make semantics indispensable during training, SMAP introduces a tail token dropping strategy, which forces semantic conditions and early latent prefixes to bear increasing responsibility under progressively reduced token budgets. To verify that the resulting latent space is useful for generation rather than reconstruction alone, we further introduce CARD, a hybrid Causal AutoRegressive--Diffusion generator. Extensive experiments on ImageNet show that SMAP consistently improves reconstruction quality across discrete and continuous tokenization settings, and that its semantically grounded latent space yields strong downstream generation performance under compact token budgets.

Semantic-Aware Prefix Learning for Token-Efficient Image Generation

Abstract

Visual tokenizers play a central role in latent image generation by bridging high-dimensional images and tractable generative modeling. However, most existing tokenizers are still trained with reconstruction-dominated objectives, which often yield latent representations that are only weakly grounded in high-level semantics. Recent approaches improve semantic alignment, but typically treat semantic signals as auxiliary regularization rather than making them functionally necessary for representation learning. We propose SMAP, a SeMantic-Aware Prefix tokenizer that injects class-level semantic conditions into a query-based 1D tokenization framework. To make semantics indispensable during training, SMAP introduces a tail token dropping strategy, which forces semantic conditions and early latent prefixes to bear increasing responsibility under progressively reduced token budgets. To verify that the resulting latent space is useful for generation rather than reconstruction alone, we further introduce CARD, a hybrid Causal AutoRegressive--Diffusion generator. Extensive experiments on ImageNet show that SMAP consistently improves reconstruction quality across discrete and continuous tokenization settings, and that its semantically grounded latent space yields strong downstream generation performance under compact token budgets.

Paper Structure

This paper contains 20 sections, 11 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Semantic-aware prefix learning in reconstruction and generation.Top: Using only the class condition, SMAP reconstructs images that already capture category-level semantics and coarse global structure. Middle: Adding latent tokens substantially improves reconstruction fidelity and restores instance-specific details, showing that semantic conditions and latent prefixes play complementary roles. Bottom: Based on the resulting semantically grounded token space, CARD generates high-quality class-conditional images.
  • Figure 2: Overview of our method. (a) proposes a novel mechanism for semantic injection. It extracts conditional embeddings from class labels and inserts them between visual patch tokens and learnable latent tokens. The condition embeddings act as intermediaries that interact jointly with image patches to guide the formation of latent tokens. It further strengthens semantic dependency through a tail token dropping strategy. (b) proposes a hybrid Causal AutoRegressive–Diffusion framework that fully leverages SMAP’s capabilities. (c) shows the SMAP tokenization process for CARD generation.
  • Figure 3: ImageNet-1K reconstruction scaling and comparison.(a) Reconstruction FID ($\mathrm{rFID}$) of SMAP under different token budgets and model scales. Across VQ, KL, and SoftVQ variants, increasing the number of latent tokens consistently improves reconstruction quality, and larger SMAP models achieve stronger performance under the same token budget. (b) Reconstruction comparison with prior 1D tokenizers. At matched token lengths, SMAP consistently outperforms TiTok and TA-TiTok, with the largest gains observed in continuous latent settings. Overall, the results show that SMAP scales favorably with both token budget and model capacity, while providing substantially better reconstruction quality than existing baselines.
  • Figure 4: Semantic identity is controlled by $C$, while instance-level details are carried by $Z$. We visualize reconstructions obtained by independently manipulating the semantic condition $C$ and latent tokens $Z$. Using only $C$ with $Z=\emptyset$ yields coarse reconstructions that preserve category-level semantics. In contrast, cross-combining $C$ from one image with $Z$ from another transfers semantic identity and instance-specific appearance in a complementary manner.
  • Figure 5: Effect of semantic-aware tokenization on downstream generation. We compare three settings: a reconstruction-only tokenizer with independent generator conditioning, a semantic-aware tokenizer with independent generator conditioning, and a shared-semantic setting in which the generator reuses the tokenizer's learned semantic embedding space. Semantic-aware tokenizer pretraining consistently improves $\mathrm{gFID}$ across all token budgets, and semantic sharing yields a further gain in every setting.