Table of Contents
Fetching ...

Semantic Context Matters: Improving Conditioning for Autoregressive Models

Dongyang Jin, Ryan Xu, Jianhao Zeng, Rui Lan, Yancheng Bai, Lei Sun, Xiangxiang Chu

TL;DR

This work tackles the challenge of weak conditioning in autoregressive image editing by introducing SCAR, a semantic‑context driven conditioning framework. SCAR replaces dense VQ prefixes with Compressed Semantic Prefiling and adds Semantic Alignment Guidance to align the model’s internal representations with target semantics before decoding. Through extensive experiments on controllable generation and instruction editing, SCAR achieves state‑of‑the‑art fidelity and semantic adherence across next‑token and next‑set AR paradigms, while maintaining efficiency. The approach leverages frozen Vision Foundation Models (e.g., DINOv2) to provide robust semantic priors and demonstrates strong generalization and practical impact for unified multimodal systems. Code will be released to support adoption and further development.

Abstract

Recently, autoregressive (AR) models have shown strong potential in image generation, offering better scalability and easier integration with unified multi-modal systems compared to diffusion-based methods. However, extending AR models to general image editing remains challenging due to weak and inefficient conditioning, often leading to poor instruction adherence and visual artifacts. To address this, we propose SCAR, a Semantic-Context-driven method for Autoregressive models. SCAR introduces two key components: Compressed Semantic Prefilling, which encodes high-level semantics into a compact and efficient prefix, and Semantic Alignment Guidance, which aligns the last visual hidden states with target semantics during autoregressive decoding to enhance instruction fidelity. Unlike decoding-stage injection methods, SCAR builds upon the flexibility and generality of vector-quantized-based prefilling while overcoming its semantic limitations and high cost. It generalizes across both next-token and next-set AR paradigms with minimal architectural changes. SCAR achieves superior visual fidelity and semantic alignment on both instruction editing and controllable generation benchmarks, outperforming prior AR-based methods while maintaining controllability. All code will be released.

Semantic Context Matters: Improving Conditioning for Autoregressive Models

TL;DR

This work tackles the challenge of weak conditioning in autoregressive image editing by introducing SCAR, a semantic‑context driven conditioning framework. SCAR replaces dense VQ prefixes with Compressed Semantic Prefiling and adds Semantic Alignment Guidance to align the model’s internal representations with target semantics before decoding. Through extensive experiments on controllable generation and instruction editing, SCAR achieves state‑of‑the‑art fidelity and semantic adherence across next‑token and next‑set AR paradigms, while maintaining efficiency. The approach leverages frozen Vision Foundation Models (e.g., DINOv2) to provide robust semantic priors and demonstrates strong generalization and practical impact for unified multimodal systems. Code will be released to support adoption and further development.

Abstract

Recently, autoregressive (AR) models have shown strong potential in image generation, offering better scalability and easier integration with unified multi-modal systems compared to diffusion-based methods. However, extending AR models to general image editing remains challenging due to weak and inefficient conditioning, often leading to poor instruction adherence and visual artifacts. To address this, we propose SCAR, a Semantic-Context-driven method for Autoregressive models. SCAR introduces two key components: Compressed Semantic Prefilling, which encodes high-level semantics into a compact and efficient prefix, and Semantic Alignment Guidance, which aligns the last visual hidden states with target semantics during autoregressive decoding to enhance instruction fidelity. Unlike decoding-stage injection methods, SCAR builds upon the flexibility and generality of vector-quantized-based prefilling while overcoming its semantic limitations and high cost. It generalizes across both next-token and next-set AR paradigms with minimal architectural changes. SCAR achieves superior visual fidelity and semantic alignment on both instruction editing and controllable generation benchmarks, outperforming prior AR-based methods while maintaining controllability. All code will be released.

Paper Structure

This paper contains 19 sections, 8 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: (a) Decoding-stage injection and (b) Prefilling-stage conditioning for condition injection. (c) Training cost under different visual token compression. 4× reduces GPU memory usage by 23.9% (from 56.6 to 43.1GB) and accelerates training by 1.42×. (d) Comparison between ours and VQ token prefilling.
  • Figure 2: Overview of our proposed SCAR, a prefilling-based method for autoregressive image editing. SCAR is composed of (a) Compressed Semantic Prefilling (see \ref{['sec:scp']} for details) and (c) Semantic Alignment Guidance (see \ref{['sec:sra']} for details), jointly enabling semantically guided generation. The framework is general and compatible with both next-token and next-set AR paradigms.
  • Figure 3: Visualization of C2I controllable generation. Our SCAR demonstrates results respectively based on VAR var and LlamaGen llamagen.
  • Figure 4: Visualization of multi-condition controllable SCAR-Uni (based on LlamaGen) under varying control conditions.
  • Figure 5: Visualization of T2I controllable generation. Our SCAR generates images with significantly higher visual quality.
  • ...and 6 more figures