Table of Contents
Fetching ...

Imagine yourself: Tuning-Free Personalized Image Generation

Zecheng He, Bo Sun, Felix Juefei-Xu, Haoyu Ma, Ankit Ramchandani, Vincent Cheung, Siddharth Shah, Anmol Kalia, Harihar Subramanyam, Alireza Zareian, Li Chen, Ankit Jain, Ning Zhang, Peizhao Zhang, Roshan Sumbaly, Peter Vajda, Animesh Sinha

TL;DR

This work tackles scalable, high-quality personalized image generation without subject-specific tuning by introducing Emu-Personalization. It combines SynPairs synthetic paired data, a fully parallel image-text fusion architecture with three text encoders and a trainable vision encoder, and a coarse-to-fine multi-stage finetuning regime, all augmented by LoRA to preserve foundation-model quality. Across extensive human evaluations, Emu-Personalization achieves state-of-the-art identity preservation, text alignment, and visual appeal, outperforming tuning-based and tuning-free baselines. The approach further extends to multi-subject personalization, enabling simultaneous identity control and prompt-driven editing with improved diversity and realism.

Abstract

Diffusion models have demonstrated remarkable efficacy across various image-to-image tasks. In this research, we introduce Imagine yourself, a state-of-the-art model designed for personalized image generation. Unlike conventional tuning-based personalization techniques, Imagine yourself operates as a tuning-free model, enabling all users to leverage a shared framework without individualized adjustments. Moreover, previous work met challenges balancing identity preservation, following complex prompts and preserving good visual quality, resulting in models having strong copy-paste effect of the reference images. Thus, they can hardly generate images following prompts that require significant changes to the reference image, \eg, changing facial expression, head and body poses, and the diversity of the generated images is low. To address these limitations, our proposed method introduces 1) a new synthetic paired data generation mechanism to encourage image diversity, 2) a fully parallel attention architecture with three text encoders and a fully trainable vision encoder to improve the text faithfulness, and 3) a novel coarse-to-fine multi-stage finetuning methodology that gradually pushes the boundary of visual quality. Our study demonstrates that Imagine yourself surpasses the state-of-the-art personalization model, exhibiting superior capabilities in identity preservation, visual quality, and text alignment. This model establishes a robust foundation for various personalization applications. Human evaluation results validate the model's SOTA superiority across all aspects (identity preservation, text faithfulness, and visual appeal) compared to the previous personalization models.

Imagine yourself: Tuning-Free Personalized Image Generation

TL;DR

This work tackles scalable, high-quality personalized image generation without subject-specific tuning by introducing Emu-Personalization. It combines SynPairs synthetic paired data, a fully parallel image-text fusion architecture with three text encoders and a trainable vision encoder, and a coarse-to-fine multi-stage finetuning regime, all augmented by LoRA to preserve foundation-model quality. Across extensive human evaluations, Emu-Personalization achieves state-of-the-art identity preservation, text alignment, and visual appeal, outperforming tuning-based and tuning-free baselines. The approach further extends to multi-subject personalization, enabling simultaneous identity control and prompt-driven editing with improved diversity and realism.

Abstract

Diffusion models have demonstrated remarkable efficacy across various image-to-image tasks. In this research, we introduce Imagine yourself, a state-of-the-art model designed for personalized image generation. Unlike conventional tuning-based personalization techniques, Imagine yourself operates as a tuning-free model, enabling all users to leverage a shared framework without individualized adjustments. Moreover, previous work met challenges balancing identity preservation, following complex prompts and preserving good visual quality, resulting in models having strong copy-paste effect of the reference images. Thus, they can hardly generate images following prompts that require significant changes to the reference image, \eg, changing facial expression, head and body poses, and the diversity of the generated images is low. To address these limitations, our proposed method introduces 1) a new synthetic paired data generation mechanism to encourage image diversity, 2) a fully parallel attention architecture with three text encoders and a fully trainable vision encoder to improve the text faithfulness, and 3) a novel coarse-to-fine multi-stage finetuning methodology that gradually pushes the boundary of visual quality. Our study demonstrates that Imagine yourself surpasses the state-of-the-art personalization model, exhibiting superior capabilities in identity preservation, visual quality, and text alignment. This model establishes a robust foundation for various personalization applications. Human evaluation results validate the model's SOTA superiority across all aspects (identity preservation, text faithfulness, and visual appeal) compared to the previous personalization models.
Paper Structure (27 sections, 2 equations, 12 figures, 2 tables)

This paper contains 27 sections, 2 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Generated results for the four reference images (depicted below) using Emu-Personalization. The single reference image is used to generate those subjects in novel poses and styles.
  • Figure 2: Overview of Emu-Personalization model architecture. We introduced a fully parallel architecture that incorporates three text encoders and a trainable vision encoder for optimizing identity preservation and text-alignment. We adopted LoRA on top of the self-attention layers and the text cross-attention layers to best preserve the foundation model's image generation quality.
  • Figure 3: Generation pipeline for SynPairs data. We first caption real images using multi-modal LLM and rewrite through a LLM rewriter. The prompt is fed into a text-to-image generation model to obtain high-quality synthetic images, and then refined with the reference image to better preserve identity. This results in high-quality paired data, i.e., same identity with varying expression, pose, and lighting conditions, etc.
  • Figure 4: Fully parallel image-text fusion architecture. We employ three distinct text-encoders: CLIP ViT-L radford2021learning text encoder, UL2 raffel2020exploring, and ByT5 xue2022byt5, as the text conditioning. They interact with a trainable CLIP vision encoder through fully parallel attention fusion.
  • Figure 5: Training with real images has higher identity, training with synthetic images has higher prompt alignment. After an interleaved multi-staged training, identity and prompt alignment achieves best trade-off.
  • ...and 7 more figures