Table of Contents
Fetching ...

Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning

Xinlu Zhang, Zhiyu Zoey Chen, Xi Ye, Xianjun Yang, Lichang Chen, William Yang Wang, Linda Ruth Petzold

TL;DR

This work interrogates how coding data used in Instruction Fine-Tuning (IFT) activates LLM reasoning, evaluating three coding-proportion regimes (0%, 50%, 100%) across six backbones and twelve tasks in three domains ($symbolic$, $logical$, $arithmetic$). Using a ShareGPT-derived data pipeline, the study finds that coding data generally improves overall reasoning at the IFT stage, with substantial gains for some models (e.g., $10.3$ percentage points for Mistral) and varying effects across domains and tasks. Domain-level analysis shows strongest benefits in symbolic reasoning and highlights context-dependent trade-offs for arithmetic tasks, while task-level results reveal no universal optimal coding proportion. A case study illustrates that Code-tuned models can produce correct, text-based reasoning paths, not just code, underscoring genuine reasoning enhancement beyond in-domain programming. These findings suggest coding-data fine-tuning can activate broader reasoning capabilities in LLMs and guide data-mixing strategies for instruction tuning.

Abstract

Instruction Fine-Tuning (IFT) significantly enhances the zero-shot capabilities of pretrained Large Language Models (LLMs). While coding data is known to boost LLM reasoning abilities during pretraining, its role in activating internal reasoning capacities during IFT remains understudied. This paper investigates a key question: How does coding data impact LLMs' reasoning capacities during IFT stage? To explore this, we thoroughly examine the impact of coding data across different coding data proportions, model families, sizes, and reasoning domains, from various perspectives. Specifically, we create three IFT datasets with increasing coding data proportions, fine-tune six LLM backbones across different families and scales on these datasets, evaluate the tuned models' performance across twelve tasks in three reasoning domains, and analyze the outcomes from three broad-to-granular perspectives: overall, domain-level, and task-specific. Our holistic analysis provides valuable insights into each perspective. First, coding data tuning enhances the overall reasoning capabilities of LLMs across different model families and scales. Moreover, while the impact of coding data varies by domain, it shows consistent trends within each domain across different model families and scales. Additionally, coding data generally provides comparable task-specific benefits across model families, with optimal proportions in IFT datasets being task-dependent.

Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning

TL;DR

This work interrogates how coding data used in Instruction Fine-Tuning (IFT) activates LLM reasoning, evaluating three coding-proportion regimes (0%, 50%, 100%) across six backbones and twelve tasks in three domains (, , ). Using a ShareGPT-derived data pipeline, the study finds that coding data generally improves overall reasoning at the IFT stage, with substantial gains for some models (e.g., percentage points for Mistral) and varying effects across domains and tasks. Domain-level analysis shows strongest benefits in symbolic reasoning and highlights context-dependent trade-offs for arithmetic tasks, while task-level results reveal no universal optimal coding proportion. A case study illustrates that Code-tuned models can produce correct, text-based reasoning paths, not just code, underscoring genuine reasoning enhancement beyond in-domain programming. These findings suggest coding-data fine-tuning can activate broader reasoning capabilities in LLMs and guide data-mixing strategies for instruction tuning.

Abstract

Instruction Fine-Tuning (IFT) significantly enhances the zero-shot capabilities of pretrained Large Language Models (LLMs). While coding data is known to boost LLM reasoning abilities during pretraining, its role in activating internal reasoning capacities during IFT remains understudied. This paper investigates a key question: How does coding data impact LLMs' reasoning capacities during IFT stage? To explore this, we thoroughly examine the impact of coding data across different coding data proportions, model families, sizes, and reasoning domains, from various perspectives. Specifically, we create three IFT datasets with increasing coding data proportions, fine-tune six LLM backbones across different families and scales on these datasets, evaluate the tuned models' performance across twelve tasks in three reasoning domains, and analyze the outcomes from three broad-to-granular perspectives: overall, domain-level, and task-specific. Our holistic analysis provides valuable insights into each perspective. First, coding data tuning enhances the overall reasoning capabilities of LLMs across different model families and scales. Moreover, while the impact of coding data varies by domain, it shows consistent trends within each domain across different model families and scales. Additionally, coding data generally provides comparable task-specific benefits across model families, with optimal proportions in IFT datasets being task-dependent.
Paper Structure (22 sections, 7 figures, 11 tables)

This paper contains 22 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Pipeline Overview. The process utilizes ShareGPT data as a starting point. 1. Data Category Classification: Using ChatGPT to classify code instances within ShareGPT to obtain a code-centric IFT dataset. 2. IFT Data Mixture: Constructing three IFT mixture datasets with increasing proportions of coding data. 3. Instruction Finetuning: Fine-tuning LLMs from six families across various scales with these three IFT datasets, respectively. 4. Evaluation: Evaluating the fine-tuned models' reasoning capacities across three domains. 5. Analysis: Analyzing from three broad-to-granular perspectives.
  • Figure 2: Overall comparison across different model scales. Results show the overall performance of models using Llama-1 and Llama-2 with 7B and 13B parameters as backbones.
  • Figure 3: Overall performance comparison across different coding data percentages.
  • Figure 4: Results comparison under each reasoning domain across different model families.
  • Figure 5: Results comparison for each dataset across on (a) Llama-1-13B and (b) Llama-2-13B. The best results of each dataset are highlighted with a black frame.
  • ...and 2 more figures