Table of Contents
Fetching ...

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, Jifeng Dai

TL;DR

The paper tackles instability and forgetting in monolithic multimodal LLMs by embedding external visual parameters as multimodal experts inside a pre-trained LLM. It introduces Mono-InternVL and the data-efficient Mono-InternVL-1.5 with Endogenous Visual Pre-training (EViP) and EViP++, plus a fused CUDA kernel to speed up inference. Through staged delta-tuning and a focus on high-quality data, the approach achieves strong performance on 15 multimodal benchmarks while dramatically reducing data and computation costs. The work demonstrates that monolithic MLLMs can rival modular counterparts in efficiency and competitiveness, offering a scalable path for future multimodal systems.

Abstract

This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

TL;DR

The paper tackles instability and forgetting in monolithic multimodal LLMs by embedding external visual parameters as multimodal experts inside a pre-trained LLM. It introduces Mono-InternVL and the data-efficient Mono-InternVL-1.5 with Endogenous Visual Pre-training (EViP) and EViP++, plus a fused CUDA kernel to speed up inference. Through staged delta-tuning and a focus on high-quality data, the approach achieves strong performance on 15 multimodal benchmarks while dramatically reducing data and computation costs. The work demonstrates that monolithic MLLMs can rival modular counterparts in efficiency and competitiveness, offering a scalable path for future multimodal systems.

Abstract

This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.

Paper Structure

This paper contains 16 sections, 8 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Comparison of Mono-InternVL, Mono-InternVL-1.5 and existing MLLMs. Compared with modular MLLMs, Mono-InternVL and Mono-InternVL-1.5 embed visual experts into the pre-trained LLM and integrates visual encoding and language decoding into a single LLM. Through endogenous visual pre-training (EViP), Mono-InternVL significantly pushes the performance boundaries of monolithic MLLMs. With EViP++, Mono-InternVL-1.5 not only significantly reduces data costs, but also maintains the competitive performance of downstream tasks.
  • Figure 2: Monolithic architecture of Mono-InternVL and Mono-InternVL-1.5. Mono-InternVL is designed as a multimodal MoE structure, where visual and textual tokens are processed by the corresponding experts. Mono-InternVL-1.5 further integrates the attention experts and the MoE CUDA kernel to facilitate the visual pre-training while retaining the model efficiency.
  • Figure 3: The training recipe of Mono-InternVL (top) and Mono-InternVL-1.5 (bottom). In the first stage, Mono-InternVL is progressively pre-trained on massive data via three sub-stages (S1.1, S1.2, S1.3), where most parameters of LLM are frozen to preserve the pre-trained knowledge. In the second stage (S2), the entire model is optimized to accommodate various instructions. Compared to Mono-InternVL, Mono-InternVL-1.5 integrates visual attention experts and reduces up to 58% training data.
  • Figure 4: Illustration of Mono-InternVL-1.5 fused kernel workflow. The left thread blocks handle textual tokens while those on the right handle visual tokens. Although two thread blocks are assigned per data block, nearly half exit immediately upon entry, making the kernel effectively behave as a single-branch implementation.
  • Figure 5: Ablation studies of EViP and EViP++ with the increase of pre-training data size across three sub-stages: (S1.1) Concept learning; (S1.2) Semantic learning; (S1.3) Alignment learning. For each data point, we fine-tune the corresponding pre-trained model on the instruction data of LLaVA-665k and obtain the downstream performance.
  • ...and 1 more figures