Table of Contents
Fetching ...

Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions

Siqi Kou, Qingyuan Tian, Hanwen Xu, Zihao Zeng, Zhijie Deng

TL;DR

The paper tackles the problem of identifying which training-data attributes most effectively stimulate math and code reasoning in large language models. It introduces Infra, a framework that uses influence functions to attribute reasoning performance to training data at instance, sequence, and token levels, employing a mean log-likelihood surrogate and Hessian approximations via EK-FAC. Key contributions include revealing cross-domain influences where high-difficulty math improves both domains and where low-difficulty code aids code reasoning, demonstrating a difficulty-flip data reweighting that doubles AIME24 accuracy and improves LiveCodeBench performance, and showing that sequence-level exploratory behavior enhances reasoning while token-level patterns differ between math and code. The work provides a data-centric, interpretable path toward targeted dataset curation for robust multi-domain reasoning in LLMs, with practical implications for constructing training data and understanding reasoning processes.

Abstract

Large language models (LLMs) have demonstrated remarkable reasoning capabilities in math and coding, often bolstered by post-training on the chain-of-thoughts (CoTs) generated by stronger models. However, existing strategies for curating such training data predominantly rely on heuristics, limiting generalizability and failing to capture subtleties underlying in data. To address these limitations, we leverage influence functions to systematically attribute LLMs' reasoning ability on math and coding to individual training examples, sequences, and tokens, enabling deeper insights into effective data characteristics. Our Influence-based Reasoning Attribution (Infra) uncovers nontrivial cross-domain effects across math and coding tasks: high-difficulty math examples improve both math and code reasoning, while low-difficulty code tasks most effectively benefit code reasoning. Based on these findings, we introduce a simple yet effective dataset reweighting strategy by flipping task difficulty, which doubles AIME24 accuracy from 10\% to 20\% and boosts LiveCodeBench accuracy from 33.8\% to 35.3\% for Qwen2.5-7B-Instruct. Moreover, our fine-grained attribution reveals that the sequence-level exploratory behaviors enhance reasoning performance in both math and code, and the token-level influence patterns are distinct for math and code reasoning: the former prefers natural language logic connectors and the latter emphasizes structural syntax.

Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions

TL;DR

The paper tackles the problem of identifying which training-data attributes most effectively stimulate math and code reasoning in large language models. It introduces Infra, a framework that uses influence functions to attribute reasoning performance to training data at instance, sequence, and token levels, employing a mean log-likelihood surrogate and Hessian approximations via EK-FAC. Key contributions include revealing cross-domain influences where high-difficulty math improves both domains and where low-difficulty code aids code reasoning, demonstrating a difficulty-flip data reweighting that doubles AIME24 accuracy and improves LiveCodeBench performance, and showing that sequence-level exploratory behavior enhances reasoning while token-level patterns differ between math and code. The work provides a data-centric, interpretable path toward targeted dataset curation for robust multi-domain reasoning in LLMs, with practical implications for constructing training data and understanding reasoning processes.

Abstract

Large language models (LLMs) have demonstrated remarkable reasoning capabilities in math and coding, often bolstered by post-training on the chain-of-thoughts (CoTs) generated by stronger models. However, existing strategies for curating such training data predominantly rely on heuristics, limiting generalizability and failing to capture subtleties underlying in data. To address these limitations, we leverage influence functions to systematically attribute LLMs' reasoning ability on math and coding to individual training examples, sequences, and tokens, enabling deeper insights into effective data characteristics. Our Influence-based Reasoning Attribution (Infra) uncovers nontrivial cross-domain effects across math and coding tasks: high-difficulty math examples improve both math and code reasoning, while low-difficulty code tasks most effectively benefit code reasoning. Based on these findings, we introduce a simple yet effective dataset reweighting strategy by flipping task difficulty, which doubles AIME24 accuracy from 10\% to 20\% and boosts LiveCodeBench accuracy from 33.8\% to 35.3\% for Qwen2.5-7B-Instruct. Moreover, our fine-grained attribution reveals that the sequence-level exploratory behaviors enhance reasoning performance in both math and code, and the token-level influence patterns are distinct for math and code reasoning: the former prefers natural language logic connectors and the latter emphasizes structural syntax.

Paper Structure

This paper contains 16 sections, 10 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: An illustration of our key findings towards the question: Which attributes of training data effectively stimulate reasoning capabilities? Mixing challenging math problems with easier coding tasks leads to the highest influence scores for mathematical and coding reasoning (left). Guided by this insight, we curate an improved dataset and observe enhanced performance (right).
  • Figure 2: Cross-domain influence analysis of LLaMA3-8B-Base fine-tuned on combined MetaMathQA and OSS-Instruct for math and code performance. The most beneficial examples for math performance predominantly come from the math domain, while code-domain data also contributes non-trivially (left). A similar cross-domain benefit is observed for code performance (right).
  • Figure 3: Average influence score of the training dataset combining MetaMathQA and OSS-Instruct, evaluated on MBPP and GSM8K performance. Results are grouped by training data category (left) and MATH problem difficulty (right).
  • Figure 4: Different types of MATH questions from MetaMathQA yu2023metamath dataset.
  • Figure 5: Left: Average influence scores of math and code training data from varying difficulty levels on reasoning performance. For instance, Math$\xrightarrow{}$Code denotes the influence of math data on code reasoning tasks. Right: Distribution of math and code samples across difficulty levels in the BS17k dataset. The original distribution is shown alongside the adjusted distribution obtained via the difficulty-flip strategy. See Table \ref{['tab:sft']} for a comparison of SFT results under different mixing strategies.
  • ...and 6 more figures