InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Changyao Tian; Danni Yang; Guanzhou Chen; Erfei Cui; Zhaokai Wang; Yuchen Duan; Penghao Yin; Sitao Chen; Ganlin Yang; Mingxin Liu; Zirun Zhu; Ziqian Fan; Leyao Gu; Haomin Wang; Qi Wei; Jinhui Yin; Xue Yang; Zhihang Zhong; Qi Qin; Yi Xin; Bin Fu; Yihao Liu; Jiaye Ge; Qipeng Guo; Gen Luo; Hongsheng Li; Yu Qiao; Kai Chen; Hongjie Zhang

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu, Jiaye Ge, Qipeng Guo, Gen Luo, Hongsheng Li, Yu Qiao, Kai Chen, Hongjie Zhang

TL;DR

This report presents InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework, and constructs a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning.

Abstract

Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

TL;DR

Abstract

Paper Structure (55 sections, 14 equations, 29 figures, 24 tables)

This paper contains 55 sections, 14 equations, 29 figures, 24 tables.

Introduction
Related Work
Multimodal Large Language Models
Visual Generative Models
Unified Multimodal Models
Method: InternVL-U
Model Architecture
Overall Design Principles
Visual Generation Head
Training Strategy
Training Objective
Training Pipeline
Data Construction
Open-source Data Collection
General Data Preprocessing and Synthesis
...and 40 more sections

Figures (29)

Figure 1: Showcases of InternVL-U for general text-to-image generation (top) and image editing (bottom). InternVL-U supports high-fidelity image generation and editing at any resolution.
Figure 2: Showcases of InternVL-U for spatial-centric, perception, science-centric, humor-centric, and reasoning-centric text-to-image generation or editing tasks. InternVL-U demonstrates such core multimodal capabilities across various visual domains.
Figure 3: The architectural design of InternVL-U. The framework highlights three design principles: (1) unified contextual modeling supporting modality-adaptive generation targets, (2) structural efficiency via a unified backbone with modality-specific modular design, and (3) decoupled visual representations for understanding and generation tasks. Und. and Gen. denote Understanding and Generation, respectively.
Figure 4: Architecture of the Visual Generation Head. (a) Overview of the head with dual-stream MMDiT blocks. (b) Detailed structure of the Dual-Stream Attention Block and Dual-Stream FFN Block. (c) Illustration of the Unified MSRoPE (Multi-Scale Rotary Positional Embeddings) applied to VAE image latents and multimodal context embeddings.
Figure 5: Examples of general data synthesized by our pipeline. The synthesized data features varied textual annotations and covers diverse visual domains, including portraits, posters, natural scenes, etc.
...and 24 more figures

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

TL;DR

Abstract

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (29)