Table of Contents
Fetching ...

SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu, Daniel McDuff, Tim Althoff

TL;DR

SynthWorlds provides a scalable, automatic framework to disentangle reasoning from parametric world knowledge in language models by constructing parallel real-mapped and synthetic-mapped corpora with identical structures. It introduces two reasoning-rich, parallel tasks—multi-hop QA and page navigation—and demonstrates that a persistent knowledge advantage gap remains even with retrieval augmentation and reasoning integration. The framework enables precise measurement of how much parametric knowledge contributes to task performance and offers a controlled environment to compare different knowledge acquisition and integration strategies. Findings highlight opportunities to improve knowledge grounding and reasoning in novel environments, establishing SynthWorlds as a valuable testbed for robust, generalizable LM systems.

Abstract

Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.

SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

TL;DR

SynthWorlds provides a scalable, automatic framework to disentangle reasoning from parametric world knowledge in language models by constructing parallel real-mapped and synthetic-mapped corpora with identical structures. It introduces two reasoning-rich, parallel tasks—multi-hop QA and page navigation—and demonstrates that a persistent knowledge advantage gap remains even with retrieval augmentation and reasoning integration. The framework enables precise measurement of how much parametric knowledge contributes to task performance and offers a controlled environment to compare different knowledge acquisition and integration strategies. Findings highlight opportunities to improve knowledge grounding and reasoning in novel environments, establishing SynthWorlds as a valuable testbed for robust, generalizable LM systems.

Abstract

Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.

Paper Structure

This paper contains 24 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Controlled experiments from SynthWorlds corpora. The knowledge advantage gap (KA) is the performance difference between parallel tasks mapped to real-world (RM) and synthetic (SM) entities. Retrieval and page content boosts performance but the gap persists.
  • Figure 2: Overview of SynthWorlds Corpora Construction (Toy Example). A connected subgraph is sampled from a large knowledge base (a). To obscure factual knowledge, entity labels are renamed from real-world labels (real-mapped) to synthetic name (synth-mapped) (b). From synth-mapped triplets, we generate synth-mapped documents. These documents are converted to real-mapped documents through additional LM steps with symbolic references (c). The final output is two parallel corpora: one real-mapped, one synth-mapped. Using the corpora, we construct parallel reasoning tasks (§\ref{['sec:task_construction']}).
  • Figure 3: Multi-hop QA Construction. Subgraphs matching reasoning motifs are sampled with constraints to ensure uniqueness, diversity, and multi-hop reasoning (a). From their triplet facts, we generate synth-mapped single-hop questions (b), which are composed into a synth-mapped multi-hop question (c). Using the synth-to-real entity mapping, we replace synth names with real names (d). The final output is parallel sets of real-mapped and synth-mapped multi-hop questions.
  • Figure 4: Multi-hop QA Results by Reasoning Motifs. We report F1 scores on SynthWorld-RM (RM) and SynthWorld-SM (SM), along with the knowledge advantage gap ($\mathrm{KA} = \mathrm{F1}_{\mathrm{RM}} - \mathrm{F1}_{\mathrm{SM}}$). Settings: CB = Closed-book, RAG = One-step RAG, CoT+RAG = IRCoT + RAG, RC = Reading Comprehension. We show Recall@5 for RAG baselines (by construction, CB has recall $=0$ and RC has recall $=1$). IRCoT + RAG substantially reduces the KA gap compared to the CB baseline, primarily due to improved retrieval. Example questions for each motif are given in Table \ref{['tab:graph_types']}.
  • Figure 5: Page Navigation Results by Difficulty (i.e., Expected Random Walk Distance). We report success rate on SynthWorld-RM (RM) and SynthWorld-SM (SM) and the knowledge advantage gap ($\mathrm{KA} = \mathrm{Success}_{\mathrm{RM}} - \mathrm{Success}_{\mathrm{SM}}$). Models consistently perform better on real-mapped corpora, especially in harder navigation tasks, indicating that parametric knowledge enables shortcuts. Page content (Content + Links vs. Links Only) benefits models more on synth-mapped corpora, narrowing the gap and showing its value in novel environments.
  • ...and 3 more figures