Table of Contents
Fetching ...

Pushing LLMs to Their Logical Reasoning Bound: The Role of Data Reasoning Intensity

Zhen Bi, Zhenlin Hu, Jinnan Yang, Mingyang Chen, Cheng Deng, Yida Xue, Zeyu Yang, Qing Shen, Zhenfang Liu, Kang Zhao, Ningyu Zhang, Jungang Lou

TL;DR

This work introduces Data Reasoning Intensity (DRI), a metric that quantifies the latent logical reasoning value embedded in training data and frames LLM reasoning as a data-cognition tradeoff: $\eta(\mathcal{M}, \mathcal{D}) \propto \frac{E(\mathcal{D})}{C(\mathcal{M})}$. It proposes a two-phase Re-Cognizing Optimization to reshape model cognition and emphasize high-DRI data, thereby pushing models toward their reasoning boundary without increasing data volume. The methodology decomposes reasoning traces into logical elements and derives two components, $S_{ctx}$ and $S_{opt}$, to compute a final DRI score via $S = \sigma\left(\gamma \frac{\log(S_{raw}+1)-\mu}{\sqrt{\sigma^2+\epsilon}} + \beta\right)$. Empirical results across Reclor, LogicBench, LogiQA, and LogiQA2.0 with LLaMA-8B and Qwen-7B show consistent gains over data-centric baselines and benefits under reinforcement learning with GRPO, illustrating that prioritizing reasoning complexity in data is key to unlocking LLM cognitive potential.

Abstract

Recent advances in large language models (LLMs) highlight the importance of training data structure and quality in shaping reasoning behavior. However, most existing approaches focus on transforming data formats while neglecting the internal reasoning complexity of training samples, leaving the reasoning potential of data under-explored and underutilized. In this work, we posit that LLM logical reasoning performance is jointly constrained by the potential of the training data and the cognitive capacity of the model. To make this relationship measurable, we introduce Data Reasoning Intensity (DRI), a novel metric that quantifies the latent logical reasoning complexity of samples by decomposing and aggregating their logical structures. This allows us to analyze how well current LLMs utilize logical reasoning signals and identify performance gaps relative to data potential. Based on this insight, we introduce a re-cognizing optimization strategy that systematically enhances the logical reasoning intensity of training data. Rather than increasing data volume, our method re-optimizes existing samples to better align with the LLM's logical reasoning boundary. Extensive experiments show that our approach significantly improves performance and generalization over data-centric strategies. We further validate our method under a reinforcement learning framework. Our results indicate that prioritizing reasoning complexity in data rather than sheer scale or superficial form is essential to realizing LLMs' full cognitive potential.

Pushing LLMs to Their Logical Reasoning Bound: The Role of Data Reasoning Intensity

TL;DR

This work introduces Data Reasoning Intensity (DRI), a metric that quantifies the latent logical reasoning value embedded in training data and frames LLM reasoning as a data-cognition tradeoff: . It proposes a two-phase Re-Cognizing Optimization to reshape model cognition and emphasize high-DRI data, thereby pushing models toward their reasoning boundary without increasing data volume. The methodology decomposes reasoning traces into logical elements and derives two components, and , to compute a final DRI score via . Empirical results across Reclor, LogicBench, LogiQA, and LogiQA2.0 with LLaMA-8B and Qwen-7B show consistent gains over data-centric baselines and benefits under reinforcement learning with GRPO, illustrating that prioritizing reasoning complexity in data is key to unlocking LLM cognitive potential.

Abstract

Recent advances in large language models (LLMs) highlight the importance of training data structure and quality in shaping reasoning behavior. However, most existing approaches focus on transforming data formats while neglecting the internal reasoning complexity of training samples, leaving the reasoning potential of data under-explored and underutilized. In this work, we posit that LLM logical reasoning performance is jointly constrained by the potential of the training data and the cognitive capacity of the model. To make this relationship measurable, we introduce Data Reasoning Intensity (DRI), a novel metric that quantifies the latent logical reasoning complexity of samples by decomposing and aggregating their logical structures. This allows us to analyze how well current LLMs utilize logical reasoning signals and identify performance gaps relative to data potential. Based on this insight, we introduce a re-cognizing optimization strategy that systematically enhances the logical reasoning intensity of training data. Rather than increasing data volume, our method re-optimizes existing samples to better align with the LLM's logical reasoning boundary. Extensive experiments show that our approach significantly improves performance and generalization over data-centric strategies. We further validate our method under a reinforcement learning framework. Our results indicate that prioritizing reasoning complexity in data rather than sheer scale or superficial form is essential to realizing LLMs' full cognitive potential.

Paper Structure

This paper contains 40 sections, 12 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: The illustrative visualization of data potential and LLM logical reasoning performance. The horizontal axis denotes data potential and the vertical axis denotes an LLM's logical-reasoning performance. Each curve corresponds to a specific LLM (e.g., different sizes) performance bound under a fixed architecture. As the potential of data increases, the performance of LLMs usually improves, but eventually it will reach an upper limit determined jointly by the model's capacity and the limitations of the data.
  • Figure 2: The overall framework. Top: We first extract logical elements from each example and perform combinatorial reasoning to derive a Data Reasoning Intensity (DRI) score. Bottom: We then analyze the performance distribution across DRI levels on multiple datasets. Based on this, we propose the re-cognizing optimization strategy: the first stage reshapes the model’s recognition of reasoning patterns, and the second enhances its logical reasoning capability, thereby improving overall performance.
  • Figure 3: Effectiveness verification for DRI. Sample counts (bars, left axis) and model error rates (lines, right axis) are shown across DRI score bins. All three panels share the same layout: the x‐axis divides the score range into uniform intervals, the bar height indicates the number of examples per interval, and the overlaid line traces the error rate. (a) Training‐set distribution and error. (b) Original test‐set distribution. (c) Balanced test‐set distribution.
  • Figure 4: Experimental results of fine-tuning models in different intervals. "Direct-All" denotes a model fine-tuned on all training examples. "Range(x, y)" denotes a model fine-tuned on examples whose DRI scores fall between $x$ and $y$. Left: Testing set experiment results. Right: Balanced testing set experiment results. The horizontal axis represents score intervals, and the vertical axis represents error rates (lower error indicates better performance).
  • Figure 5: Experimental results of fine-tuning models using different methods. "Ours-All": model fine-tuned on the full training set using re-cognizing optimization. "Ours-Range(x, y)": model fine-tuned with re-cognizing optimization on examples whose DRI scores fall between $x$ and $y$. Left: Test dataset results. Right: Balanced testset results. The horizontal axis represents score intervals and the vertical axis represents error rates (lower error indicates better performance).
  • ...and 1 more figures