Table of Contents
Fetching ...

MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs

Feilong Chen, Yijiang Liu, Yi Huang, Hao Wang, Miren Tian, Ya-Qi Yu, Minghui Liao, Jihao Wu

TL;DR

MindVL demonstrates data-efficient multimodal LLM training on Ascend NPUs via the MindSpeed-MLLM framework, challenging hardware-dependent norms. It provides a transparent data recipe across warm-up, multitask, and SFT stages, and introduces weight averaging over varying sequence lengths plus test-time resolution search to boost performance. The results show MindVL-8B matching Qwen2.5VL-7B with 10% of the training data, and MindVL-671B-A37B matching Qwen2.5VL-72B with only 3% of data, highlighting both data efficiency and competitive modality reasoning. This work offers a practical hardware alternative, open data pipelines, and techniques that improve reproducibility and performance for multimodal AI on Ascend hardware.

Abstract

We propose MindVL, a multimodal large language model (MLLMs) trained on Ascend NPUs. The training of state-of-the-art MLLMs is often confined to a limited set of hardware platforms and relies heavily on massive, undisclosed data recipes, which hinders reproducibility and open research. To change the common perception that Ascend hardware is unsuitable for efficient full-stage MLLM training, we introduce MindSpeed-MLLM, a highly efficient training framework that supports stable and high-performance training of large-scale Dense and Mixture-of-Experts (MoE) models on Ascend hardware. Based on this, we provide a systematic and open description of the data production methods and mixing strategies for all training stages. Furthermore, we present MindVL, a data-efficient multimodal large language model trained end-to-end on Ascend NPUs. In addition, we find that averaging weights from checkpoints trained with different sequence lengths is particularly effective and yields further gains when combined with test-time resolution search. Our experiments demonstrate superior data efficiency: MindVL-8B matches the performance of Qwen2.5VL-7B using only 10\% of its training data, while our MoE model, MindVL-671B-A37B, matches Qwen2.5VL-72B using only 3\% of the Qwen2.5VL training data, and achieves comparable performance with other leading multimodal MoE models. Our work provides the community with a valuable hardware alternative, open data recipes, and effective performance-enhancing techniques.

MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs

TL;DR

MindVL demonstrates data-efficient multimodal LLM training on Ascend NPUs via the MindSpeed-MLLM framework, challenging hardware-dependent norms. It provides a transparent data recipe across warm-up, multitask, and SFT stages, and introduces weight averaging over varying sequence lengths plus test-time resolution search to boost performance. The results show MindVL-8B matching Qwen2.5VL-7B with 10% of the training data, and MindVL-671B-A37B matching Qwen2.5VL-72B with only 3% of data, highlighting both data efficiency and competitive modality reasoning. This work offers a practical hardware alternative, open data pipelines, and techniques that improve reproducibility and performance for multimodal AI on Ascend hardware.

Abstract

We propose MindVL, a multimodal large language model (MLLMs) trained on Ascend NPUs. The training of state-of-the-art MLLMs is often confined to a limited set of hardware platforms and relies heavily on massive, undisclosed data recipes, which hinders reproducibility and open research. To change the common perception that Ascend hardware is unsuitable for efficient full-stage MLLM training, we introduce MindSpeed-MLLM, a highly efficient training framework that supports stable and high-performance training of large-scale Dense and Mixture-of-Experts (MoE) models on Ascend hardware. Based on this, we provide a systematic and open description of the data production methods and mixing strategies for all training stages. Furthermore, we present MindVL, a data-efficient multimodal large language model trained end-to-end on Ascend NPUs. In addition, we find that averaging weights from checkpoints trained with different sequence lengths is particularly effective and yields further gains when combined with test-time resolution search. Our experiments demonstrate superior data efficiency: MindVL-8B matches the performance of Qwen2.5VL-7B using only 10\% of its training data, while our MoE model, MindVL-671B-A37B, matches Qwen2.5VL-72B using only 3\% of the Qwen2.5VL training data, and achieves comparable performance with other leading multimodal MoE models. Our work provides the community with a valuable hardware alternative, open data recipes, and effective performance-enhancing techniques.

Paper Structure

This paper contains 45 sections, 15 figures, 15 tables.

Figures (15)

  • Figure 2: The Overall Architecture of MindSpeed-MLLM and Its Relationship with Other MindSpeed Frameworks.
  • Figure 3: Data curation process of MindVL training data.
  • Figure 4: Box plots of accuracies with varying input images resolutions for different models. Maximum value, minimum value, median and quartiles are plotted.
  • Figure 5: Comparison of loss values among various training tools and platforms on the COCO dataset.
  • Figure 6: loss decline trend on the in-house slow thinking dataset.
  • ...and 10 more figures