Table of Contents
Fetching ...

Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation

Yi Wu, Shengju Qian, Lingting Zhu, Lei Liu, Wandi Qiao, Ziqiang Li, Lequan Yu, Bin Li

TL;DR

The paper tackles subject-driven image generation with multimodal autoregressive models, which lag behind diffusion models in this niche. It introduces Proxy-Tuning, a three-stage method that uses a fine-tuned diffusion model to generate proxy data for training AR models, producing an unexpected weak-to-strong generalization where AR models outperform their diffusion supervisors in subject fidelity and prompt adherence. The approach proves robust across supervisor architectures and enables efficient multi-subject personalization, highlighting AR architectures' capacity to integrate and extend learned subject features. The work opens new avenues for cross-architecture knowledge transfer and scalable subject-specific generation, while outlining theoretical and automation-oriented directions for future work.

Abstract

Multimodal autoregressive (AR) models, based on next-token prediction and transformer architecture, have demonstrated remarkable capabilities in various multimodal tasks including text-to-image (T2I) generation. Despite their strong performance in general T2I tasks, our research reveals that these models initially struggle with subject-driven image generation compared to dominant diffusion models. To address this limitation, we introduce Proxy-Tuning, leveraging diffusion models to enhance AR models' capabilities in subject-specific image generation. Our method reveals a striking weak-to-strong phenomenon: fine-tuned AR models consistently outperform their diffusion model supervisors in both subject fidelity and prompt adherence. We analyze this performance shift and identify scenarios where AR models excel, particularly in multi-subject compositions and contextual understanding. This work not only demonstrates impressive results in subject-driven AR image generation, but also unveils the potential of weak-to-strong generalization in the image generation domain, contributing to a deeper understanding of different architectures' strengths and limitations.

Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation

TL;DR

The paper tackles subject-driven image generation with multimodal autoregressive models, which lag behind diffusion models in this niche. It introduces Proxy-Tuning, a three-stage method that uses a fine-tuned diffusion model to generate proxy data for training AR models, producing an unexpected weak-to-strong generalization where AR models outperform their diffusion supervisors in subject fidelity and prompt adherence. The approach proves robust across supervisor architectures and enables efficient multi-subject personalization, highlighting AR architectures' capacity to integrate and extend learned subject features. The work opens new avenues for cross-architecture knowledge transfer and scalable subject-specific generation, while outlining theoretical and automation-oriented directions for future work.

Abstract

Multimodal autoregressive (AR) models, based on next-token prediction and transformer architecture, have demonstrated remarkable capabilities in various multimodal tasks including text-to-image (T2I) generation. Despite their strong performance in general T2I tasks, our research reveals that these models initially struggle with subject-driven image generation compared to dominant diffusion models. To address this limitation, we introduce Proxy-Tuning, leveraging diffusion models to enhance AR models' capabilities in subject-specific image generation. Our method reveals a striking weak-to-strong phenomenon: fine-tuned AR models consistently outperform their diffusion model supervisors in both subject fidelity and prompt adherence. We analyze this performance shift and identify scenarios where AR models excel, particularly in multi-subject compositions and contextual understanding. This work not only demonstrates impressive results in subject-driven AR image generation, but also unveils the potential of weak-to-strong generalization in the image generation domain, contributing to a deeper understanding of different architectures' strengths and limitations.

Paper Structure

This paper contains 14 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Visualizations of parameter-efficient (LoRA) and end-to-end direct subject fine-tuning on AR model.
  • Figure 2: Framework of Proxy-Tuning. Provided with a limited number of images of a particular target, we initially conduct subject-tuning (i.e., DreamBooth) on the diffusion model. Then the diffusion model is employed as a supervisor to supervise the fine-tuning of the AR model. Direct subject-tuning of the AR model results in an unsatisfactory acquisition of the target appearance, and the utilization of data augmentations in subject-tuning manifests a deficiency in semantic editability (as discussed in Section \ref{['sec:ablation']}). Conversely, Proxy-Tuning effectively captures the target appearance and simultaneously showcases excellent semantic editability.
  • Figure 3: Visualization of our Proxy-Tuning method. We use SDXL and SD3 respectively as the weak supervisors to supervise and fine-tune Lumina-mGPT. Eventually, the subject appearance (the images in the first row) learned by the fine-tuned Lumina-mGPT (the images in the third row and the images in the fifth row) is even better than that of their weak supervisors (the images in the second row and the images in the fourth row).
  • Figure 4: Visualization of the multiple subjects personalization. We fine-tune SD3 and SD3.5 to learn multiple subjects simultaneously and employ Proxy-Tuning on Lumina-mGPT with the supervision of SDXL to learn multiple subjects simultaneously.
  • Figure 5: Qualitative comparison between Proxy-Tuning, fine-tuning with data augmentation and fine-tuning with the original dataset. The AR model is LlamaGen and the diffusion supervisor in Proxy-Tuning is SDXL.
  • ...and 1 more figures