Table of Contents
Fetching ...

$π$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering

Chao Wan, Albert Gong, Mihir Mishra, Carl-Leander Henneking, Claas Beger, Kilian Q. Weinberger

TL;DR

Multi-hop QA suffers from reasoning drift in standard Chain-of-Thought prompting within retrieval-augmented setups. π-CoT proposes a training-free, prompt-based pipeline that translates complex questions into Prolog queries, resolves them via a SLICE module, and uses the intermediate Prolog artifacts to initialize the final CoT step, effectively merging symbolic planning with neural retrieval. Across HotpotQA, 2WikiMultiHopQA, MuSiQue, and PhantomWiki, π-CoT matches or exceeds standard RAG and in-context CoT, with notable gains on harder, multi-branch questions and robustness to long contexts. The approach enhances reliability and interpretability by preserving a structured reasoning trace and enabling stepwise context management, suggesting a promising direction for symbolic-neural hybrids in open-domain QA.

Abstract

Chain-of-Thought (CoT) prompting significantly enhances large language models' (LLMs) problem-solving capabilities, but still struggles with complex multi-hop questions, often falling into circular reasoning patterns or deviating from the logical path entirely. This limitation is particularly acute in retrieval-augmented generation (RAG) settings, where obtaining the right context is critical. We introduce Prolog-Initialized Chain-of-Thought ($π$-CoT), a novel prompting strategy that combines logic programming's structural rigor with language models' flexibility. $π$-CoT reformulates multi-hop questions into Prolog queries decomposed as single-hop sub-queries. These are resolved sequentially, producing intermediate artifacts, with which we initialize the subsequent CoT reasoning procedure. Extensive experiments demonstrate that $π$-CoT significantly outperforms standard RAG and in-context CoT on multi-hop question-answering benchmarks.

$π$-CoT: Prolog-Initialized Chain-of-Thought Prompting for Multi-Hop Question-Answering

TL;DR

Multi-hop QA suffers from reasoning drift in standard Chain-of-Thought prompting within retrieval-augmented setups. π-CoT proposes a training-free, prompt-based pipeline that translates complex questions into Prolog queries, resolves them via a SLICE module, and uses the intermediate Prolog artifacts to initialize the final CoT step, effectively merging symbolic planning with neural retrieval. Across HotpotQA, 2WikiMultiHopQA, MuSiQue, and PhantomWiki, π-CoT matches or exceeds standard RAG and in-context CoT, with notable gains on harder, multi-branch questions and robustness to long contexts. The approach enhances reliability and interpretability by preserving a structured reasoning trace and enabling stepwise context management, suggesting a promising direction for symbolic-neural hybrids in open-domain QA.

Abstract

Chain-of-Thought (CoT) prompting significantly enhances large language models' (LLMs) problem-solving capabilities, but still struggles with complex multi-hop questions, often falling into circular reasoning patterns or deviating from the logical path entirely. This limitation is particularly acute in retrieval-augmented generation (RAG) settings, where obtaining the right context is critical. We introduce Prolog-Initialized Chain-of-Thought (-CoT), a novel prompting strategy that combines logic programming's structural rigor with language models' flexibility. -CoT reformulates multi-hop questions into Prolog queries decomposed as single-hop sub-queries. These are resolved sequentially, producing intermediate artifacts, with which we initialize the subsequent CoT reasoning procedure. Extensive experiments demonstrate that -CoT significantly outperforms standard RAG and in-context CoT on multi-hop question-answering benchmarks.

Paper Structure

This paper contains 43 sections, 4 equations, 4 figures, 23 tables.

Figures (4)

  • Figure 1: Overview of $\pi$-CoT. Left: $\pi$-CoT executes an LLM-generated Prolog query, using the SLICE module to resolve each sub-query $q_t$. Right: $\pi$-CoT uses the passages, notes, and (potentially) answer from the SLICE modules to initialize the CoT prompt for the final LLM call.
  • Figure 2: SLICE module for fact verification in the RAG setting. At step $t=2$, the module takes in the previous state $S_1$ containing variable assignments, the current sub-query $q_2$, and the corpus $\mathcal{C}$ (not shown) as inputs and outputs $S_2$. Only the Prolog fact (in green) corresponding to a valid statement is added to a growing knowledge base.
  • Figure 3: Breakdown of $\pi$-CoT predictions by Prolog execution outcome. Each pie chart shows the distribution of predictions from \ref{['tab:fullwiki']} across four categories based on whether Prolog returns an answer ($S_T \neq \emptyset$ vs. $S_T = \emptyset$) and whether the final answer is correct (green) or incorrect (red). Lighter shades indicate cases where Prolog returns no answer ($S_T = \emptyset$), while solid colors indicate successful Prolog execution ($S_T \neq \emptyset$).
  • Figure 4: F1 score vs. difficulty, as measured by number of reasoning steps. We use the synthetic PW-S benchmark from gong2025phantomwiki and display mean ± 1 standard error. For each model, we evaluate CoT and $\pi$-CoT prompting.

Theorems & Definitions (1)

  • Example 1