EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

Xiangyu Zhao; Bo Liu; Qijiong Liu; Guangyuan Shi; Xiao-Ming Wu

EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

Xiangyu Zhao, Bo Liu, Qijiong Liu, Guangyuan Shi, Xiao-Ming Wu

TL;DR

Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation.

Abstract

We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs), Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities,EasyGen leverages BiDiffuser,a bidirectional conditional diffusion model, to foster more efficient modality interactions. Easygen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space, Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation. The source code is available at https://github.com/zxy556677/EasyGen.

EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

TL;DR

Abstract

Paper Structure (32 sections, 14 equations, 13 figures, 14 tables)

This paper contains 32 sections, 14 equations, 13 figures, 14 tables.

Introduction
Related Work
Basics of Diffusion Models
Conditional Generation.
Proposed Model: EasyGen
Pre-training BiDiffuser: A Bidirectional Conditional Diffusion Model
Pre-training an Adapter to Enhance BiDiffuser's SUR Capability
Image-to-Text Generation
Aligning BiDiffuser with LLMs
Instruction-Tuning LLMs
Text-to-Image Response Generation
Experiments
Experimental Setup
Evaluation
Overall Results
...and 17 more sections

Figures (13)

Figure 1: Our model EasyGen can understand multimodal inputs and generate multimodal responses, as illustrated by model-generated speech bubbles in grey color, which include both text and images.
Figure 2: Overview of EasyGen.
Figure 3: The training of BiDiffuser involves finetuning the denoising transformer U-ViT in UniDiffuser with a joint objective of image-to-text and text-to-image tasks.
Figure 4: Two different ways of aligning BiDiffuser with LLMs.
Figure 5: Text-to-image generation by EasyGen. LLM generates the response and description of the image. BiDiffuser generates images based on the description.
...and 8 more figures

EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

TL;DR

Abstract

EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (13)