M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

Inclusion AI; :; Fudong Wang; Jiajia Liu; Jingdong Chen; Jun Zhou; Kaixiang Ji; Lixiang Ru; Qingpei Guo; Ruobing Zheng; Tianqi Li; Yi Yuan; Yifan Mao; Yuting Xiao; Ziping Ma

M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

Inclusion AI, :, Fudong Wang, Jiajia Liu, Jingdong Chen, Jun Zhou, Kaixiang Ji, Lixiang Ru, Qingpei Guo, Ruobing Zheng, Tianqi Li, Yi Yuan, Yifan Mao, Yuting Xiao, Ziping Ma

TL;DR

M2-Reasoning-7B addresses the gap in dynamic spatial reasoning for multimodal large language models by integrating a high-quality, multi-stage data pipeline with 294.2K samples and a dynamic, multi-task RLVR training framework. The approach blends a curriculum-based data ordering, step-wise optimization, and task-specific rewards to harmonize general and spatial reasoning capabilities. Empirical results across eight benchmarks show state-of-the-art performance in both general and spatial reasoning, highlighting robust, unified reasoning in diverse modalities. The work advances practical multimodal reasoning for real-world tasks by improving structured thought, temporal-spatial understanding, and instruction-following accuracy.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs), particularly through Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced their reasoning abilities. However, a critical gap persists: these models struggle with dynamic spatial interactions, a capability essential for real-world applications. To bridge this gap, we introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.

M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

TL;DR

Abstract

M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)