Table of Contents
Fetching ...

Personalized Text-to-Image Generation with Auto-Regressive Models

Kaiyue Sun, Xian Liu, Yao Teng, Xihui Liu

TL;DR

This work investigates personalized image generation with auto-regressive models by introducing a two-stage training protocol that first optimizes a subject-specific text embedding and then fine-tunes transformer layers. Using the Lumina-mGPT 7B model, the approach achieves competitive subject fidelity and prompt following relative to diffusion-based personalization methods, addressing the gap in applying unified autoregressive architectures to personalization. The results demonstrate the viability of AR-based personalized generation for re-contextualization, accessorization, and property modification, while highlighting practical limitations in speed and the need for responsible use. Overall, the paper offers a new direction for personalized text-to-image synthesis and suggests avenues to improve efficiency and safety in autoregressive multimodal models.

Abstract

Personalized image synthesis has emerged as a pivotal application in text-to-image generation, enabling the creation of images featuring specific subjects in diverse contexts. While diffusion models have dominated this domain, auto-regressive models, with their unified architecture for text and image modeling, remain underexplored for personalized image generation. This paper investigates the potential of optimizing auto-regressive models for personalized image synthesis, leveraging their inherent multimodal capabilities to perform this task. We propose a two-stage training strategy that combines optimization of text embeddings and fine-tuning of transformer layers. Our experiments on the auto-regressive model demonstrate that this method achieves comparable subject fidelity and prompt following to the leading diffusion-based personalization methods. The results highlight the effectiveness of auto-regressive models in personalized image generation, offering a new direction for future research in this area.

Personalized Text-to-Image Generation with Auto-Regressive Models

TL;DR

This work investigates personalized image generation with auto-regressive models by introducing a two-stage training protocol that first optimizes a subject-specific text embedding and then fine-tunes transformer layers. Using the Lumina-mGPT 7B model, the approach achieves competitive subject fidelity and prompt following relative to diffusion-based personalization methods, addressing the gap in applying unified autoregressive architectures to personalization. The results demonstrate the viability of AR-based personalized generation for re-contextualization, accessorization, and property modification, while highlighting practical limitations in speed and the need for responsible use. Overall, the paper offers a new direction for personalized text-to-image synthesis and suggests avenues to improve efficiency and safety in autoregressive multimodal models.

Abstract

Personalized image synthesis has emerged as a pivotal application in text-to-image generation, enabling the creation of images featuring specific subjects in diverse contexts. While diffusion models have dominated this domain, auto-regressive models, with their unified architecture for text and image modeling, remain underexplored for personalized image generation. This paper investigates the potential of optimizing auto-regressive models for personalized image synthesis, leveraging their inherent multimodal capabilities to perform this task. We propose a two-stage training strategy that combines optimization of text embeddings and fine-tuning of transformer layers. Our experiments on the auto-regressive model demonstrate that this method achieves comparable subject fidelity and prompt following to the leading diffusion-based personalization methods. The results highlight the effectiveness of auto-regressive models in personalized image generation, offering a new direction for future research in this area.

Paper Structure

This paper contains 18 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview. Using only a few reference images (typically 3-5) of a subject (left), we fine-tune an auto-regressive model to generate personalized images of the subject in diverse contexts (right), guided by text prompts.
  • Figure 2: Overview of Fine-tuning. We fine-tune a text-to-image auto-regressive model using 3-5 input images, each paired with a text prompt that includes a unique identifier and the subject class name (e.g., "A photo of [V] dog"). The process involves two stages: first, we fine-tune the text embedding for the identifier [V], and second, we additionally fine-tune the transformer layers to enhance the model's performance.
  • Figure 3: Qualitative results. We generate images of personalized objects to showcase the generative capabilities of re-contextualization and property modification.
  • Figure 4: Qualitative results. We generate images of personalized animals to showcase the generative capabilities of re-contextualization and accessorization.
  • Figure 5: Preservation of class semantic priors. Fine-tuning auto-regressive models with a set of reference images does not result in language drift or reduced output diversity. The first column displays the training images, the next three columns show images generated using free-form prompts that include the specific subject class name.
  • ...and 2 more figures