Table of Contents
Fetching ...

Learning from Synthetic Data Improves Multi-hop Reasoning

Anmol Kabra, Yilun Yin, Albert Gong, Kamilė Stankevičiūtė, Dongyoung Go, Johann Lee, Katie Z. Luo, Carla P. Gomes, Kilian Q. Weinberger

TL;DR

This work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities and discovers that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge.

Abstract

Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge -- a fundamental and generalizable reasoning skill. Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.

Learning from Synthetic Data Improves Multi-hop Reasoning

TL;DR

This work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities and discovers that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge.

Abstract

Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge -- a fundamental and generalizable reasoning skill. Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.
Paper Structure (20 sections, 1 equation, 10 figures, 2 tables)

This paper contains 20 sections, 1 equation, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Synthetic-to-real transfer by learning knowledge composition. We RL fine-tune LLMs on rule-generated synthetic datasets (left). Despite containing no real-world knowledge, training on these datasets teaches models to compose knowledge (center). This fundamental skill of chaining information across multiple steps transfers to real-world multi-hop reasoning benchmarks (right).
  • Figure 2: F1 scores on real-world multi-hop reasoning datasets of LLMs RL fine-tuned on synthetic datasets. We observe consistent transfer from synthetic data training to real-world multi-hop reasoning benchmarks: $\{$RG-Family, RG-Knights, GSM-$\infty$, $\text{PhantomWiki\xspace}\} \rightarrow$$\{$HotpotQA, 2WikiMultihopQA, MuSiQue, CofCA, $\text{SynthWorlds-RM\xspace}\}$. The performance transfer trends hold across model families and sizes (Qwen and Phi LLMs in 0.6-4B parameter range). On each synthetic dataset, we fine-tune each LLM with 2 random training seeds, and evaluate final checkpoints of both experiment runs. With this we calculate the standard error, shown as error bars. See \ref{['fig:synthetic_data_performance_transfer_large_models']} for similar plots for larger LLMs Qwen3-4B and Qwen2.5-7B-Instruct.
  • Figure 3: Performance of intermediate training checkpoints on unseen test sets, stratified by question complexity. We evaluate Qwen3-0.6B and Qwen3-1.7B intermediate checkpoints trained on PhantomWiki (left) and GSM-$\infty$ (right) on held-out test sets. Test sets share no factual overlap with training data. Question complexity is defined as the number of document hops (PhantomWiki) or arithmetic operations (GSM-$\infty$) required to reach the answer. As fine-tuning progresses (lines growing darker), performance improves across all complexity levels, including out-of-domain (OOD) difficulties beyond those seen during training. See \ref{['fig:reasoning_evolution_Qwen2.5-1.5B-Instruct__Phi-4-mini-reasoning']} for similar plots for other LLMs.
  • Figure 4: Groundedness of Qwen3-0.6B reasoning traces in MuSiQue and CofCA intermediate answers. For each dataset, we plot the fraction of model generations (reasoning traces) that contain the ground-truth nth intermediate answer. As RL fine-tuning progresses on PhantomWiki (left) and GSM-$\infty$ (right), reasoning traces include a higher proportion of correct intermediate answers. This indicates that RL fine-tuning with outcome-only reward on synthetic data also improves the reasoning process. See \ref{['fig:reasoning_evolution_msq_others']} for similar analysis on other models.
  • Figure 5: Real-world multi-hop reasoning performance of Qwen3-0.6B intermediate training checkpoints. We evaluate checkpoints every 500 training steps, and report mean $\pm$ standard error with the solid line and shaded region. Performance on all benchmarks improves steadily with training steps---equivalently, with the number of synthetic training samples---providing evidence for synthetic data scaling in RL fine-tuning.
  • ...and 5 more figures