Table of Contents
Fetching ...

Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning

Shenshen Li, Xing Xu, Kaiyuan Deng, Lei Wang, Heng Tao Shen, Fumin Shen

TL;DR

This work introduces Reasoning Activation Potential (RAP), a data-selection framework for multi-modal large language models that identifies high-value cognitive samples driving genuine cross-modal reasoning. RAP relies on two estimators: the Causal Discrepancy Estimator (CDE), which uses a Potential Outcome Model to quantify output-level reliance on visual input, and the Attention Confidence Estimator (ACE), which filters samples with attention-biased reasoning using token-level attention maps; a Difficulty-aware Replacement Module (DRM) then substitutes easy samples with harder, cognitively informative ones. Empirically, RAP achieves state-of-the-art performance using only about 9–10% of full training data and reduces training costs by over 40%, across multiple datasets and model bases, validating the “truth in the few” principle for multi-modal reasoning. The approach enhances cross-modal reasoning utilization and generalizes across model architectures and RL settings, with potential extensions to SFT and dynamic data selection to further boost efficiency and upper bounds of reasoning in MLLMs.

Abstract

While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP)}, which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%.

Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning

TL;DR

This work introduces Reasoning Activation Potential (RAP), a data-selection framework for multi-modal large language models that identifies high-value cognitive samples driving genuine cross-modal reasoning. RAP relies on two estimators: the Causal Discrepancy Estimator (CDE), which uses a Potential Outcome Model to quantify output-level reliance on visual input, and the Attention Confidence Estimator (ACE), which filters samples with attention-biased reasoning using token-level attention maps; a Difficulty-aware Replacement Module (DRM) then substitutes easy samples with harder, cognitively informative ones. Empirically, RAP achieves state-of-the-art performance using only about 9–10% of full training data and reduces training costs by over 40%, across multiple datasets and model bases, validating the “truth in the few” principle for multi-modal reasoning. The approach enhances cross-modal reasoning utilization and generalizes across model architectures and RL settings, with potential extensions to SFT and dynamic data selection to further boost efficiency and upper bounds of reasoning in MLLMs.

Abstract

While multi-modal large language models (MLLMs) have made significant progress in complex reasoning tasks via reinforcement learning, it is commonly believed that extensive training data is necessary for improving multi-modal reasoning ability, inevitably leading to data redundancy and substantial computational costs. However, can smaller high-value datasets match or outperform full corpora for multi-modal reasoning in MLLMs? In this work, we challenge this assumption through a key observation: meaningful multi-modal reasoning is triggered by only a sparse subset of training samples, termed cognitive samples, whereas the majority contribute marginally. Building on this insight, we propose a novel data selection paradigm termed Reasoning Activation Potential (RAP)}, which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning by two complementary estimators: 1) Causal Discrepancy Estimator (CDE) based on the potential outcome model principle, eliminates samples that overly rely on language priors by comparing outputs between multi-modal and text-only inputs; 2) Attention Confidence Estimator (ACE), which exploits token-level self-attention to discard samples dominated by irrelevant but over-emphasized tokens in intermediate reasoning stages. Moreover, we introduce a Difficulty-aware Replacement Module (DRM) to substitute trivial instances with cognitively challenging ones, thereby ensuring complexity for robust multi-modal reasoning. Experiments on six datasets show that our RAP method consistently achieves superior performance using only 9.3% of the training data, while reducing computational costs by over 43%.

Paper Structure

This paper contains 14 sections, 8 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of (a) accuracy under varying training dataset sizes and (b) performance--efficiency trade-offs on various methods.
  • Figure 2: Illustrative examples for two ineffective training sample types: (a) language-prior biased samples and (b) attention-biased samples.
  • Figure 3: The overall pipeline of our RAP method. First, the Causal Discrepancy Estimator (CDE) filters out samples that overly rely on language priors via output-level discrepancy. Then, the Attention Confidence Estimator (ACE) excludes attention-biased samples by token-level attention distributions. Finally, the Difficulty-aware Replacement Module (DRM) selectively replaces trivial instances with cognitively challenging ones, yielding a refined subset of cognitive samples.
  • Figure 4: Cross-model generalization of cognitive samples selected by RAP. Performance with InternVL3-2B trained on samples from Qwen2.5-VL-3B (left), and vice versa (right).
  • Figure 5: (a) Visualization of output discrepancies between multi-modal and text-only inputs on the full MM-Eureka training dataset. (b) Performance variation with respect to the hyperparameters $\lambda_{a}$ and $\lambda_{c}$ on MMstar. (c) Comparative analysis of multi-modal reasoning utilization on four datasets.
  • ...and 2 more figures