Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions

Siqi Kou; Qingyuan Tian; Hanwen Xu; Zihao Zeng; Zhijie Deng

Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions

Siqi Kou, Qingyuan Tian, Hanwen Xu, Zihao Zeng, Zhijie Deng

TL;DR

The paper tackles the problem of identifying which training-data attributes most effectively stimulate math and code reasoning in large language models. It introduces Infra, a framework that uses influence functions to attribute reasoning performance to training data at instance, sequence, and token levels, employing a mean log-likelihood surrogate and Hessian approximations via EK-FAC. Key contributions include revealing cross-domain influences where high-difficulty math improves both domains and where low-difficulty code aids code reasoning, demonstrating a difficulty-flip data reweighting that doubles AIME24 accuracy and improves LiveCodeBench performance, and showing that sequence-level exploratory behavior enhances reasoning while token-level patterns differ between math and code. The work provides a data-centric, interpretable path toward targeted dataset curation for robust multi-domain reasoning in LLMs, with practical implications for constructing training data and understanding reasoning processes.

Abstract

Large language models (LLMs) have demonstrated remarkable reasoning capabilities in math and coding, often bolstered by post-training on the chain-of-thoughts (CoTs) generated by stronger models. However, existing strategies for curating such training data predominantly rely on heuristics, limiting generalizability and failing to capture subtleties underlying in data. To address these limitations, we leverage influence functions to systematically attribute LLMs' reasoning ability on math and coding to individual training examples, sequences, and tokens, enabling deeper insights into effective data characteristics. Our Influence-based Reasoning Attribution (Infra) uncovers nontrivial cross-domain effects across math and coding tasks: high-difficulty math examples improve both math and code reasoning, while low-difficulty code tasks most effectively benefit code reasoning. Based on these findings, we introduce a simple yet effective dataset reweighting strategy by flipping task difficulty, which doubles AIME24 accuracy from 10\% to 20\% and boosts LiveCodeBench accuracy from 33.8\% to 35.3\% for Qwen2.5-7B-Instruct. Moreover, our fine-grained attribution reveals that the sequence-level exploratory behaviors enhance reasoning performance in both math and code, and the token-level influence patterns are distinct for math and code reasoning: the former prefers natural language logic connectors and the latter emphasizes structural syntax.

Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions

TL;DR

Abstract

Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)