Table of Contents
Fetching ...

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, Junjie Hu

TL;DR

MENTOR introduces an efficient autoregressive framework for multimodal image generation that directly aligns multimodal inputs with output tokens using a two-stage training paradigm. The architecture unifies a frozen multimodal encoder with a transformer decoder, enabling fine-grained, token-level alignment without cross-attention adapters. On DreamBench++ it achieves competitive CP·PF with substantially reduced data and compute, demonstrating strong controllability and image fidelity while maintaining versatility across multimodal tasks. Despite using weaker backbones, Mentor highlights the potential of AR architectures and staged training to deliver practical, scalable multimodal generation with efficient resource use.

Abstract

Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

TL;DR

MENTOR introduces an efficient autoregressive framework for multimodal image generation that directly aligns multimodal inputs with output tokens using a two-stage training paradigm. The architecture unifies a frozen multimodal encoder with a transformer decoder, enabling fine-grained, token-level alignment without cross-attention adapters. On DreamBench++ it achieves competitive CP·PF with substantially reduced data and compute, demonstrating strong controllability and image fidelity while maintaining versatility across multimodal tasks. Despite using weaker backbones, Mentor highlights the potential of AR architectures and staged training to deliver practical, scalable multimodal generation with efficient resource use.

Abstract

Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

Paper Structure

This paper contains 46 sections, 3 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 3: Overview of Mentor. Left panel illustrates model structure, where visual and textual inputs are encoded into a unified latent to guide autoregressive image generation. Right panel highlights two-stage training paradigm: (1) Multimodal Alignment Tuning, enabling pixel and semantic-level alignment between inputs and output tokens; and (2) Multimodal Instruction Tuning, compels model to effectively balance influence of different modalities.
  • Figure 4: Qualitative study on Image Reconstruction.
  • Figure 5: Overview of text-guided visual distillation using the Query-based variant of Mentor.
  • Figure 6: Qualitative examples of different methods compared to Mentor on DreamBench++.
  • Figure 7: Qualitative assessment demonstrating improved preservation of visual details by Mentor following multi-image training.
  • ...and 8 more figures