Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation
Yi Wu, Shengju Qian, Lingting Zhu, Lei Liu, Wandi Qiao, Ziqiang Li, Lequan Yu, Bin Li
TL;DR
The paper tackles subject-driven image generation with multimodal autoregressive models, which lag behind diffusion models in this niche. It introduces Proxy-Tuning, a three-stage method that uses a fine-tuned diffusion model to generate proxy data for training AR models, producing an unexpected weak-to-strong generalization where AR models outperform their diffusion supervisors in subject fidelity and prompt adherence. The approach proves robust across supervisor architectures and enables efficient multi-subject personalization, highlighting AR architectures' capacity to integrate and extend learned subject features. The work opens new avenues for cross-architecture knowledge transfer and scalable subject-specific generation, while outlining theoretical and automation-oriented directions for future work.
Abstract
Multimodal autoregressive (AR) models, based on next-token prediction and transformer architecture, have demonstrated remarkable capabilities in various multimodal tasks including text-to-image (T2I) generation. Despite their strong performance in general T2I tasks, our research reveals that these models initially struggle with subject-driven image generation compared to dominant diffusion models. To address this limitation, we introduce Proxy-Tuning, leveraging diffusion models to enhance AR models' capabilities in subject-specific image generation. Our method reveals a striking weak-to-strong phenomenon: fine-tuned AR models consistently outperform their diffusion model supervisors in both subject fidelity and prompt adherence. We analyze this performance shift and identify scenarios where AR models excel, particularly in multi-subject compositions and contextual understanding. This work not only demonstrates impressive results in subject-driven AR image generation, but also unveils the potential of weak-to-strong generalization in the image generation domain, contributing to a deeper understanding of different architectures' strengths and limitations.
