Table of Contents
Fetching ...

MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm

Wei Chen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li, Zide Liu, Xuhao Pan, Chang Ren, Xudong Rao, Chenfeng Wang, Tao Wei, Chengjun Yu, Pengfei Yu, Yufei Zheng, Chunpeng Zhou, Pan Zhou, Xuhan Zhu

TL;DR

MindGPT-4ov tackles the challenge of translating general multimodal capabilities to vertical domains while preserving broad generalization and user experience. It introduces a comprehensive post training framework consisting of information density based data synthesis, collaborative curriculum supervised fine tuning, and a multi stage hybrid reinforcement learning regimen, augmented by a 5D parallel training setup and deployment optimizations. Across broad multimodal benchmarks and vertical task scenarios, MindGPT-4ov demonstrates competitive general performance and notable gains in domain specific tasks, while delivering improved response conciseness and user experience. The approach is scalable and reproducible, designed to transfer without changing base architectures and with public release of code and data components.

Abstract

We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.

MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm

TL;DR

MindGPT-4ov tackles the challenge of translating general multimodal capabilities to vertical domains while preserving broad generalization and user experience. It introduces a comprehensive post training framework consisting of information density based data synthesis, collaborative curriculum supervised fine tuning, and a multi stage hybrid reinforcement learning regimen, augmented by a 5D parallel training setup and deployment optimizations. Across broad multimodal benchmarks and vertical task scenarios, MindGPT-4ov demonstrates competitive general performance and notable gains in domain specific tasks, while delivering improved response conciseness and user experience. The approach is scalable and reproducible, designed to transfer without changing base architectures and with public release of code and data components.

Abstract

We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.

Paper Structure

This paper contains 32 sections, 13 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Performance of MindGPT-4ov across multiple benchmarks.
  • Figure 2: An overview of our data pipeline. We begin by constructing a tree‑structured hierarchical label system. We then employ MLLMs to match images with their corresponding task labels. An MLLM is subsequently prompted to generate question–answer pairs based on the images, the task labels, and the information density score (IDS). Lastly, MLLMs are leveraged for matching evaluation and answer verification to ensure data quality.
  • Figure 3: The tree-structured label system. In the label tree, each branch denotes a first‑level label, while the leaves represent a selection of corresponding second‑level labels. The box diagrams above the tree illustrate representative third‑level topics encompassed by certain second‑level labels.
  • Figure 4: Overview of collaborative curriculum supervised fine-tuning (SFT). The proposed SFT training paradigm has the following advantages: on the data side, it automates dataset construction and produces balanced, diverse data; on the training side, it maintains a balanced development of knowledge and capabilities, remedies weak abilities, and safeguards the user experience. Meanwhile, the data and training sides collaborate to substantially improve the efficiency and stability of SFT training.
  • Figure 5: The distribution of RL training data.
  • ...and 7 more figures