Table of Contents
Fetching ...

TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning

Junzhe Xu, Yuyang Yin, Xi Chen

TL;DR

TBAC-UniImage addresses the challenge of unifying multimodal understanding and generation without expensive joint training by pairing a frozen MLLM with a DiT through Ladder-Side Diffusion Tuning. It utilizes learnable queries to inject multi-layer MLLM representations as hierarchical conditioning into the diffusion generator, effectively creating a ladder that bridges understanding and generation. The approach yields competitive results on GenEval, DPG-Bench, TIIF-Bench, and ImgEdit, demonstrating strong instruction-following and editing capabilities among open-source models. While promising, the work also identifies limitations in handling dense prompts, maintaining editing consistency, and rendering in-image text, pointing to future enhancements.

Abstract

This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM's final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM's intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the diffusion model. This method treats the pre-trained generator as a ladder, receiving guidance from various depths of the MLLM's understanding process. Consequently, TBAC-UniImage achieves a much deeper and more fine-grained unification of understanding and generation.

TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning

TL;DR

TBAC-UniImage addresses the challenge of unifying multimodal understanding and generation without expensive joint training by pairing a frozen MLLM with a DiT through Ladder-Side Diffusion Tuning. It utilizes learnable queries to inject multi-layer MLLM representations as hierarchical conditioning into the diffusion generator, effectively creating a ladder that bridges understanding and generation. The approach yields competitive results on GenEval, DPG-Bench, TIIF-Bench, and ImgEdit, demonstrating strong instruction-following and editing capabilities among open-source models. While promising, the work also identifies limitations in handling dense prompts, maintaining editing consistency, and rendering in-image text, pointing to future enhancements.

Abstract

This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM's final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM's intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the diffusion model. This method treats the pre-trained generator as a ladder, receiving guidance from various depths of the MLLM's understanding process. Consequently, TBAC-UniImage achieves a much deeper and more fine-grained unification of understanding and generation.

Paper Structure

This paper contains 6 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Text-to-image generation results of TBAC-UniImage
  • Figure 2: The overview of TBAC-UniImage, where the MLLM parameters are frozen, learnable queries and DiT are tuned together for denoising objective.
  • Figure 3: Text-to-image generation samples generated by TBAC-UniImage.
  • Figure 4: Image Editing samples generated by TBAC-UniImage.