Table of Contents
Fetching ...

EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model

Ruixiao Dong, Zhendong Wang, Keli Liu, Li Li, Ying Chen, Kai Li, Daowen Li, Houqiang Li

TL;DR

EchoGen tackles the efficiency bottleneck of subject-driven image generation by grounding a Visual Autoregressive (VAR) backbone with a dual-path injection that separately encodes semantic identity and fine-grained content. By combining a semantic encoder (DINOv2) and a content encoder (FLUX.1-dev VAE) through decoupled cross-attention and multi-modal attention, and by incorporating a global semantic prefix via AdaLN, EchoGen preserves subject fidelity while maintaining high-text alignment. A segmentation pipeline (Qwen2.5-VL + GroundingDINO) mitigates background noise, and a flexible subject-text classifier-free guidance scheme enables tunable trade-offs between fidelity and prompt adherence. Quantitatively, EchoGen matches or surpasses diffusion-based methods on DreamBench with substantially lower sampling latency, illustrating a practical, scalable path for real-time, subject-driven generation on VAR models.

Abstract

Subject-driven generation is a critical task in creative AI; yet current state-of-the-art methods present a stark trade-off. They either rely on computationally expensive, per-subject fine-tuning, sacrificing efficiency and zero-shot capability, or employ feed-forward architectures built on diffusion models, which are inherently plagued by slow inference speeds. Visual Auto-Regressive (VAR) models are renowned for their rapid sampling speeds and strong generative quality, making them an ideal yet underexplored foundation for resolving this tension. To bridge this gap, we introduce EchoGen, a pioneering framework that empowers VAR models with subject-driven generation capabilities. The core design of EchoGen is an effective dual-path injection strategy that disentangles a subject's high-level semantic identity from its low-level fine-grained details, enabling enhanced controllability and fidelity. We employ a semantic encoder to extract the subject's abstract identity, which is injected through decoupled cross-attention to guide the overall composition. Concurrently, a content encoder captures intricate visual details, which are integrated via a multi-modal attention mechanism to ensure high-fidelity texture and structural preservation. To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models. Both quantitative and qualitative results substantiate our design, demonstrating that EchoGen achieves subject fidelity and image quality comparable to state-of-the-art diffusion-based methods with significantly lower sampling latency. Code and models will be released soon.

EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model

TL;DR

EchoGen tackles the efficiency bottleneck of subject-driven image generation by grounding a Visual Autoregressive (VAR) backbone with a dual-path injection that separately encodes semantic identity and fine-grained content. By combining a semantic encoder (DINOv2) and a content encoder (FLUX.1-dev VAE) through decoupled cross-attention and multi-modal attention, and by incorporating a global semantic prefix via AdaLN, EchoGen preserves subject fidelity while maintaining high-text alignment. A segmentation pipeline (Qwen2.5-VL + GroundingDINO) mitigates background noise, and a flexible subject-text classifier-free guidance scheme enables tunable trade-offs between fidelity and prompt adherence. Quantitatively, EchoGen matches or surpasses diffusion-based methods on DreamBench with substantially lower sampling latency, illustrating a practical, scalable path for real-time, subject-driven generation on VAR models.

Abstract

Subject-driven generation is a critical task in creative AI; yet current state-of-the-art methods present a stark trade-off. They either rely on computationally expensive, per-subject fine-tuning, sacrificing efficiency and zero-shot capability, or employ feed-forward architectures built on diffusion models, which are inherently plagued by slow inference speeds. Visual Auto-Regressive (VAR) models are renowned for their rapid sampling speeds and strong generative quality, making them an ideal yet underexplored foundation for resolving this tension. To bridge this gap, we introduce EchoGen, a pioneering framework that empowers VAR models with subject-driven generation capabilities. The core design of EchoGen is an effective dual-path injection strategy that disentangles a subject's high-level semantic identity from its low-level fine-grained details, enabling enhanced controllability and fidelity. We employ a semantic encoder to extract the subject's abstract identity, which is injected through decoupled cross-attention to guide the overall composition. Concurrently, a content encoder captures intricate visual details, which are integrated via a multi-modal attention mechanism to ensure high-fidelity texture and structural preservation. To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models. Both quantitative and qualitative results substantiate our design, demonstrating that EchoGen achieves subject fidelity and image quality comparable to state-of-the-art diffusion-based methods with significantly lower sampling latency. Code and models will be released soon.

Paper Structure

This paper contains 23 sections, 4 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Feed-forward subject-driven generation by EchoGen. By employing a visual autoregressive paradigm, EchoGen achieves both high-quality image synthesis with lower latency, preserving intricate subject identity with exceptional efficiency.
  • Figure 2: Overview of the EchoGen architecture. The left panel illustrates the overall model framework with dual-path subject injection, while the right panel provides a detailed schematic of the EchoGen block with a carefully designed attention mask applied in the Multi-Modal Attention module to avoid feature leakage. $C$ denotes the global semantic token extracted from the semantic encoder, which is prepended to the input sequence. $S$ represents the start token for the first-scale generation. Adaptive Layer Normalization modules in the EchoGen blocks are omitted for clarity.
  • Figure 3: The pipeline of subject segmentation.
  • Figure 4: Qualitative comparison with diffusion-based methods on DreamBench ruiz2023dreambooth. For a fair comparison, we adopt the default sampling settings for all baseline models.
  • Figure 5: Visualization of the effect of classifier-free guidance scale coefficient.
  • ...and 1 more figures