Table of Contents
Fetching ...

M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

Qingpei Guo, Kaiyou Song, Zipeng Feng, Ziping Ma, Qinglong Zhang, Sirui Gao, Xuzheng Yu, Yunxiao Sun, Tai-Wei Chang, Jingdong Chen, Ming Yang, Jun Zhou

TL;DR

M2-omni tackles the challenge of building a fully capable omni-MLLM by introducing a unified multimodal framework with modality-specific encoders and a shared LLM, augmented by a three-stage training pipeline (pre-training, instruction tuning, alignment tuning) to progressively align visual, audio, video, and textual modalities. It employs a step balance strategy during pre-training and a dynamic adaptive balance strategy during instruction tuning to mitigate data-volume and convergence-rate disparities while preserving language proficiency, including a 25% pure-text data component. The largest model, M2-omni-72B, achieves OpenCompass average scores around 75.1, often outperforming open-source counterparts and approaching GPT-4o on vision-language tasks, with strong performance on audio and free-form dialogue generation as well. The work provides extensive open training data configurations and procedures, aiming to accelerate research in omni-MLLM and reduce the gap to proprietary models, thereby expanding practical multimodal applications and interactive capabilities.

Abstract

We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves competitive performance to GPT-4o. M2-omni employs a unified multimodal sequence modeling framework, which empowers Large Language Models(LLMs) to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, M2-omni can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive real-time experience. The training of such an omni-MLLM is challenged by significant disparities in data quantity and convergence rates across modalities. To address these challenges, we propose a step balance strategy during pre-training to handle the quantity disparities in modality-specific data. Additionally, a dynamically adaptive balance strategy is introduced during the instruction tuning stage to synchronize the modality-wise training progress, ensuring optimal convergence. Notably, we prioritize preserving strong performance on pure text tasks to maintain the robustness of M2-omni's language understanding capability throughout the training process. To our best knowledge, M2-omni is currently a very competitive open-source model to GPT-4o, characterized by its comprehensive modality and task support, as well as its exceptional performance. We expect M2-omni will advance the development of omni-MLLMs, thus facilitating future research in this domain.

M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

TL;DR

M2-omni tackles the challenge of building a fully capable omni-MLLM by introducing a unified multimodal framework with modality-specific encoders and a shared LLM, augmented by a three-stage training pipeline (pre-training, instruction tuning, alignment tuning) to progressively align visual, audio, video, and textual modalities. It employs a step balance strategy during pre-training and a dynamic adaptive balance strategy during instruction tuning to mitigate data-volume and convergence-rate disparities while preserving language proficiency, including a 25% pure-text data component. The largest model, M2-omni-72B, achieves OpenCompass average scores around 75.1, often outperforming open-source counterparts and approaching GPT-4o on vision-language tasks, with strong performance on audio and free-form dialogue generation as well. The work provides extensive open training data configurations and procedures, aiming to accelerate research in omni-MLLM and reduce the gap to proprietary models, thereby expanding practical multimodal applications and interactive capabilities.

Abstract

We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves competitive performance to GPT-4o. M2-omni employs a unified multimodal sequence modeling framework, which empowers Large Language Models(LLMs) to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, M2-omni can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive real-time experience. The training of such an omni-MLLM is challenged by significant disparities in data quantity and convergence rates across modalities. To address these challenges, we propose a step balance strategy during pre-training to handle the quantity disparities in modality-specific data. Additionally, a dynamically adaptive balance strategy is introduced during the instruction tuning stage to synchronize the modality-wise training progress, ensuring optimal convergence. Notably, we prioritize preserving strong performance on pure text tasks to maintain the robustness of M2-omni's language understanding capability throughout the training process. To our best knowledge, M2-omni is currently a very competitive open-source model to GPT-4o, characterized by its comprehensive modality and task support, as well as its exceptional performance. We expect M2-omni will advance the development of omni-MLLMs, thus facilitating future research in this domain.

Paper Structure

This paper contains 28 sections, 11 equations, 13 figures, 23 tables, 1 algorithm.

Figures (13)

  • Figure 1: Overall illustration of M2-omni. (Top) M2-omni employs a multi-stage training with progressively modality alignment and multimodal multi-task balanced training strategy to achieve the optimal performance of each modality. (Left-bottom) M2-omni supports as many modalities and tasks as other omni-MLLMs combined. (Right-bottom) M2-omni achieves competitive performances on a broad range of multimodal tasks among its omni-MLLM counterparts. Note that the values on Librispeech Librispeech and Aishll1 AISHELL1 are taken the reciprocal for better visualization, and the results on Librispeech Librispeech and Aishll1 AISHELL1 of GPT-4o (GPT-4o-Realtime) are taken from yao2024minicpm. More comprehensive results can be found in \ref{['sec:exp']}.
  • Figure 2: Overall architecture of M2-omni. M2-omni can process arbitrary combinations of text, image, video, and audio modalities as input, generating multimodal sequences interleaving with text, image, or audio outputs.
  • Figure 3: Illustration of the templates of image, video, and audio.
  • Figure 4: Illustration of the training pipeline of M2-omni. Both the pre-training and the instruction tuning contain three stages, designed to progressively absorb knowledge from more modalities and ensure the model's optimal performance on all modalities and tasks. $L_{un}$ and $L_{gen\_a}$ denote understanding and audio generation loss, respectively.
  • Figure 5: Illustration of the dynamic adaptive balance strategy used during the Omni-Modality Instruction Tuning stage.$\widetilde{w}_{i, t}$ refers to the loss weight allocated to the $i$-th modality at the $t$-th validation segment.
  • ...and 8 more figures