Table of Contents
Fetching ...

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

Wei Li, Xue Xu, Jiachen Liu, Xinyan Xiao

TL;DR

UNIMO-G tackles the challenge of faithfully generating images from complex multimodal prompts by unifying text-driven and subject-driven generation within a single diffusion framework. It combines a Multimodal Large Language Model encoder with a conditional denoising diffusion network and employs a two-stage training regime—text-to-image pre-training on large-scale Chinese data followed by multimodal instruction tuning—to enable faithful synthesis from interleaved text and image inputs. A data pipeline for language grounding and image segmentation, plus a visual-enhanced learning objective, strengthens alignment between input visuals and generated content. Across MS-COCO, DreamBench, and MultiBench, UNIMO-G demonstrates superior performance in both single- and multi-entity subject-driven tasks, delivering high-fidelity images that accurately reflect complex multimodal prompts and instructions.

Abstract

Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

TL;DR

UNIMO-G tackles the challenge of faithfully generating images from complex multimodal prompts by unifying text-driven and subject-driven generation within a single diffusion framework. It combines a Multimodal Large Language Model encoder with a conditional denoising diffusion network and employs a two-stage training regime—text-to-image pre-training on large-scale Chinese data followed by multimodal instruction tuning—to enable faithful synthesis from interleaved text and image inputs. A data pipeline for language grounding and image segmentation, plus a visual-enhanced learning objective, strengthens alignment between input visuals and generated content. Across MS-COCO, DreamBench, and MultiBench, UNIMO-G demonstrates superior performance in both single- and multi-entity subject-driven tasks, delivering high-fidelity images that accurately reflect complex multimodal prompts and instructions.

Abstract

Existing text-to-image diffusion models primarily generate images from text prompts. However, the inherent conciseness of textual descriptions poses challenges in faithfully synthesizing images with intricate details, such as specific entities or scenes. This paper presents UNIMO-G, a simple multimodal conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs, which demonstrates a unified ability for both text-driven and subject-driven image generation. UNIMO-G comprises two core components: a Multimodal Large Language Model (MLLM) for encoding multimodal prompts, and a conditional denoising diffusion network for generating images based on the encoded multimodal input. We leverage a two-stage training strategy to effectively train the framework: firstly pre-training on large-scale text-image pairs to develop conditional image generation capabilities, and then instruction tuning with multimodal prompts to achieve unified image generation proficiency. A well-designed data processing pipeline involving language grounding and image segmentation is employed to construct multi-modal prompts. UNIMO-G excels in both text-to-image generation and zero-shot subject-driven synthesis, and is notably effective in generating high-fidelity images from complex multimodal prompts involving multiple image entities.
Paper Structure (29 sections, 5 equations, 11 figures, 6 tables)

This paper contains 29 sections, 5 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Examples of UNIMO-G for both text-driven and zero-shot subject-driven generation. UNIMO-G can perceive free-form interleaved visual-language inputs and faithfully generate images. Particularly, it can generate images from multi-modal prompts with multiple image entities.
  • Figure 2: UNIMO-G consists of an MLLM for multimodal perception, and a conditional denoising UNet for image generation. It accepts multimodal prompts with interleaved images and texts, and generates images consistent with the image entities. Orange denotes the trainable modules; Blue denotes the frozen ones.
  • Figure 3: Overview of our data construction pipeline for multi-modal instruction tuning.
  • Figure 4: Comparison of UNIMO-G and SDXL by human evaluation. The mean and standard deviation are shown in the figure.
  • Figure 5: Comparison of UNIMO-G and KOSMOS-G on MultiBench by human evaluation. The mean and standard deviation are shown in the figure.
  • ...and 6 more figures