ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction
Jiangtong Tan, Lin Liu, Jie Huanng, Xiaopeng Zhang, Qi Tian, Feng Zhao
TL;DR
ParaUni tackles the challenge of integrating hierarchical representations from vision-language systems into diffusion-based generation within unified multimodal frameworks. It introduces a Layer Integration Module to fuse features across all VLM layers in parallel and a Layer-wise Dynamic Adjustment Mechanism to guide RL with layer-specific perturbations. Empirical results on GenEval and DPG-Bench show state-of-the-art generation quality and robust multi-reward RL improvements, with extensive ablations confirming the value of multi-layer conditioning and the LDAM design. The approach demonstrates improved fidelity and semantic alignment, with a scalable, modular architecture that can leverage multiple reward signals during RL and adapt to various prompts.
Abstract
Unified multimodal models significantly improve visual generation by combining vision-language models (VLMs) with diffusion models. However, existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference. Considering abundant and hierarchical information in VLM's layers from low-level details to high-level semantics, we propose \textbf{ParaUni}. It extracts features from variants VLM's layers in a \textbf{Para}llel way for comprehensive information interaction and retains a flexible separation architecture to enhance generation in \textbf{Uni}fied multimodal model. Concretely, visual features from all VLM's layers are fed in parallel into a Layer Integration Module (LIM), which efficiently integrates fine-grained details and semantic abstractions and provides the fused representation as a condition to the diffusion model. To further enhance performance, we reveal that these hierarchical layers respond unequally to different rewards in Reinforcement Learning (RL). Crucially, we design a Layer-wise Dynamic Adjustment Mechanism (LDAM) to facilitate multiple reward improvements that aligns the hierarchical properties of these layers using RL. Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and shows strong potential for multiple reward advances during RL stages. Code is available at https://github.com/JosephTiTan/ParaUni.
