Table of Contents
Fetching ...

UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Tsu-Jui Fu, Yusu Qian, Chen Chen, Wenze Hu, Zhe Gan, Yinfei Yang

TL;DR

UniVG addresses the fragmentation of diffusion models by proposing a single generalist diffusion model that supports T2I generation, editing, and related tasks with one set of weights. It uses a minimalist MM-DiT-based latent-diffusion backbone and a flow-matching objective, with latent noise, VAE latent, and mask concatenated along the channel dimension, and it enables external control through embedding replacement. The training pipeline is three-stage: foundation T2I pretraining, multi-task expansion, and ID-preserving finetuning on a diverse data mix, yielding strong results across tasks. The model achieves a GenEval score of $0.70$ and outperforms task-specific and unified baselines on several benchmarks, while maintaining inference efficiency by keeping a fixed sequence length.

Abstract

Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.

UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

TL;DR

UniVG addresses the fragmentation of diffusion models by proposing a single generalist diffusion model that supports T2I generation, editing, and related tasks with one set of weights. It uses a minimalist MM-DiT-based latent-diffusion backbone and a flow-matching objective, with latent noise, VAE latent, and mask concatenated along the channel dimension, and it enables external control through embedding replacement. The training pipeline is three-stage: foundation T2I pretraining, multi-task expansion, and ID-preserving finetuning on a diverse data mix, yielding strong results across tasks. The model achieves a GenEval score of and outperforms task-specific and unified baselines on several benchmarks, while maintaining inference efficiency by keeping a fixed sequence length.

Abstract

Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.

Paper Structure

This paper contains 11 sections, 4 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: We introduce UniVG, a single generalist model that can support diverse image generation tasks, including text-to-image, inpainting, identity-preserving generation, layout-guided generation, instruction-based editing, depth estimation, and referring segmentation.
  • Figure 2: An overview of our UniVG. UniVG contains a text encoder to extract prompt embeddings from the input text and an MM-DiT to perform cross-modal fusion for latent diffusion, where all visual guidance (latent noise, input image, and input mask) are concatenated along the channel dimension as a fix-length sequence for high efficiency. Additionally, an external condition can be injected through embedding replacement to have further control. Hence, a generalist UniVG can support diverse tasks, such as text-to-image, in/outpainting, instruction-based editing, layout-guided generation, and ID-preserving generation. We also consider auxiliary tasks, including depth estimation, pose estimation, and referring segmentation, to enhance its visual scene perception.
  • Figure 3: Qualitative examples of text-to-image generation. Note that we simplify the prompt for better presentation.
  • Figure 4: Results of ID-preserving generation on Unsplash-50 gal2024unsplash-50.
  • Figure 4: Qualitative comparisons of instruction-based editing.
  • ...and 4 more figures