Table of Contents
Fetching ...

ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction

Jiangtong Tan, Lin Liu, Jie Huanng, Xiaopeng Zhang, Qi Tian, Feng Zhao

TL;DR

ParaUni tackles the challenge of integrating hierarchical representations from vision-language systems into diffusion-based generation within unified multimodal frameworks. It introduces a Layer Integration Module to fuse features across all VLM layers in parallel and a Layer-wise Dynamic Adjustment Mechanism to guide RL with layer-specific perturbations. Empirical results on GenEval and DPG-Bench show state-of-the-art generation quality and robust multi-reward RL improvements, with extensive ablations confirming the value of multi-layer conditioning and the LDAM design. The approach demonstrates improved fidelity and semantic alignment, with a scalable, modular architecture that can leverage multiple reward signals during RL and adapt to various prompts.

Abstract

Unified multimodal models significantly improve visual generation by combining vision-language models (VLMs) with diffusion models. However, existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference. Considering abundant and hierarchical information in VLM's layers from low-level details to high-level semantics, we propose \textbf{ParaUni}. It extracts features from variants VLM's layers in a \textbf{Para}llel way for comprehensive information interaction and retains a flexible separation architecture to enhance generation in \textbf{Uni}fied multimodal model. Concretely, visual features from all VLM's layers are fed in parallel into a Layer Integration Module (LIM), which efficiently integrates fine-grained details and semantic abstractions and provides the fused representation as a condition to the diffusion model. To further enhance performance, we reveal that these hierarchical layers respond unequally to different rewards in Reinforcement Learning (RL). Crucially, we design a Layer-wise Dynamic Adjustment Mechanism (LDAM) to facilitate multiple reward improvements that aligns the hierarchical properties of these layers using RL. Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and shows strong potential for multiple reward advances during RL stages. Code is available at https://github.com/JosephTiTan/ParaUni.

ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction

TL;DR

ParaUni tackles the challenge of integrating hierarchical representations from vision-language systems into diffusion-based generation within unified multimodal frameworks. It introduces a Layer Integration Module to fuse features across all VLM layers in parallel and a Layer-wise Dynamic Adjustment Mechanism to guide RL with layer-specific perturbations. Empirical results on GenEval and DPG-Bench show state-of-the-art generation quality and robust multi-reward RL improvements, with extensive ablations confirming the value of multi-layer conditioning and the LDAM design. The approach demonstrates improved fidelity and semantic alignment, with a scalable, modular architecture that can leverage multiple reward signals during RL and adapt to various prompts.

Abstract

Unified multimodal models significantly improve visual generation by combining vision-language models (VLMs) with diffusion models. However, existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference. Considering abundant and hierarchical information in VLM's layers from low-level details to high-level semantics, we propose \textbf{ParaUni}. It extracts features from variants VLM's layers in a \textbf{Para}llel way for comprehensive information interaction and retains a flexible separation architecture to enhance generation in \textbf{Uni}fied multimodal model. Concretely, visual features from all VLM's layers are fed in parallel into a Layer Integration Module (LIM), which efficiently integrates fine-grained details and semantic abstractions and provides the fused representation as a condition to the diffusion model. To further enhance performance, we reveal that these hierarchical layers respond unequally to different rewards in Reinforcement Learning (RL). Crucially, we design a Layer-wise Dynamic Adjustment Mechanism (LDAM) to facilitate multiple reward improvements that aligns the hierarchical properties of these layers using RL. Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and shows strong potential for multiple reward advances during RL stages. Code is available at https://github.com/JosephTiTan/ParaUni.

Paper Structure

This paper contains 15 sections, 6 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of CLIP score changes with 28 layer features from shallow to deep in VLM. We conduct experiments on 500 prompts and find that as the number of VLM layers increased, the images generated by the diffusion model gradually shifted from focusing on detailed textures to enhancing semantic information. The CLIP score, which measures the alignment between images and text, also increased with the number of layers. It indicates that different depths of the VLM's transformer layers encode information ranging from low-level details to high-level semantics.
  • Figure 2: Comparison between (a) using only the last layer of the VLM and (b) using all layers. After using all layers, the generated images have more details, indicating that the diffusion model integrates more detailed information from the VLM, which confirms the rationality of using all layers as conditions.
  • Figure 3: Layer similarity of the last layer or all layer interaction. For last layer interaction, the similarity between layers is relatively low, while for all layer interaction, cluster-like phenomena occur between layers. From li2025unifusion, the phenomena is not limited to our base model with universality. We analyze the properties of each clustered region in \ref{['fig:reward']}.
  • Figure 4: Comparison of the response degrees of reward scores to different regions. CLIP score is quite sensitive to changes in all three regions, but is most significantly affected by the deep layer. Aesthetic score and Pickscore are most sensitive to changes in the middle layers, but have little impact on the shallow layer. Therefore, we can influence the corresponding reward score by perturbing specific layers.
  • Figure 5: Overview of ParaUni. For understanding, after inputting image tokens and text tokens, the VLM performs understanding through an autoregressive process. For generation , the context information is compressed through a learnable query, and then the learnable queries from all layers are fed into a Layer Integration Module (LIM), which consists of a Transformer module and a Layer Norm layer. The integrated features are then sent into the cross attention in diffusion for generation. In the RL phase, we designed a Layer-wise Dynamic Adjustment Mechanism (LDAM), which adds Gaussian noise perturbation to specific layers through reward and GradNorm guidance. And different layers enable perturbations under different rewards, as detailed in \ref{['alg:cap']}.
  • ...and 4 more figures