Table of Contents
Fetching ...

M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

Inclusion AI, :, Fudong Wang, Jiajia Liu, Jingdong Chen, Jun Zhou, Kaixiang Ji, Lixiang Ru, Qingpei Guo, Ruobing Zheng, Tianqi Li, Yi Yuan, Yifan Mao, Yuting Xiao, Ziping Ma

TL;DR

M2-Reasoning-7B addresses the gap in dynamic spatial reasoning for multimodal large language models by integrating a high-quality, multi-stage data pipeline with 294.2K samples and a dynamic, multi-task RLVR training framework. The approach blends a curriculum-based data ordering, step-wise optimization, and task-specific rewards to harmonize general and spatial reasoning capabilities. Empirical results across eight benchmarks show state-of-the-art performance in both general and spatial reasoning, highlighting robust, unified reasoning in diverse modalities. The work advances practical multimodal reasoning for real-world tasks by improving structured thought, temporal-spatial understanding, and instruction-following accuracy.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs), particularly through Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced their reasoning abilities. However, a critical gap persists: these models struggle with dynamic spatial interactions, a capability essential for real-world applications. To bridge this gap, we introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.

M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

TL;DR

M2-Reasoning-7B addresses the gap in dynamic spatial reasoning for multimodal large language models by integrating a high-quality, multi-stage data pipeline with 294.2K samples and a dynamic, multi-task RLVR training framework. The approach blends a curriculum-based data ordering, step-wise optimization, and task-specific rewards to harmonize general and spatial reasoning capabilities. Empirical results across eight benchmarks show state-of-the-art performance in both general and spatial reasoning, highlighting robust, unified reasoning in diverse modalities. The work advances practical multimodal reasoning for real-world tasks by improving structured thought, temporal-spatial understanding, and instruction-following accuracy.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs), particularly through Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced their reasoning abilities. However, a critical gap persists: these models struggle with dynamic spatial interactions, a capability essential for real-world applications. To bridge this gap, we introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.

Paper Structure

This paper contains 26 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Benchmark performance of M2-Reasoning-7B.
  • Figure 2: Overview of the data configurations during cold-start and RLVR.
  • Figure 3: The M2-Reasoning's model architecture is built upon the Qwen2.5-7B language model, incorporating a native-resolution vision encoder. Notably, the figure omits the MLP projector typically used to connect the vision encoder and the language model.
  • Figure 4: Visualization of the EDNM reward function for different values of the hyperparameter $\lambda$.