Table of Contents
Fetching ...

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu, Jiaye Ge, Qipeng Guo, Gen Luo, Hongsheng Li, Yu Qiao, Kai Chen, Hongjie Zhang

TL;DR

This report presents InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework, and constructs a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning.

Abstract

Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

TL;DR

This report presents InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework, and constructs a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning.

Abstract

Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.
Paper Structure (55 sections, 14 equations, 29 figures, 24 tables)

This paper contains 55 sections, 14 equations, 29 figures, 24 tables.

Figures (29)

  • Figure 1: Showcases of InternVL-U for general text-to-image generation (top) and image editing (bottom). InternVL-U supports high-fidelity image generation and editing at any resolution.
  • Figure 2: Showcases of InternVL-U for spatial-centric, perception, science-centric, humor-centric, and reasoning-centric text-to-image generation or editing tasks. InternVL-U demonstrates such core multimodal capabilities across various visual domains.
  • Figure 3: The architectural design of InternVL-U. The framework highlights three design principles: (1) unified contextual modeling supporting modality-adaptive generation targets, (2) structural efficiency via a unified backbone with modality-specific modular design, and (3) decoupled visual representations for understanding and generation tasks. Und. and Gen. denote Understanding and Generation, respectively.
  • Figure 4: Architecture of the Visual Generation Head. (a) Overview of the head with dual-stream MMDiT blocks. (b) Detailed structure of the Dual-Stream Attention Block and Dual-Stream FFN Block. (c) Illustration of the Unified MSRoPE (Multi-Scale Rotary Positional Embeddings) applied to VAE image latents and multimodal context embeddings.
  • Figure 5: Examples of general data synthesized by our pipeline. The synthesized data features varied textual annotations and covers diverse visual domains, including portraits, posters, natural scenes, etc.
  • ...and 24 more figures