Table of Contents
Fetching ...

Large Language Models Still Face Challenges in Multi-Hop Reasoning with External Knowledge

Haotong Zhang

TL;DR

The paper addresses the persistent challenges of large language models in multi-hop reasoning that relies on external knowledge. It evaluates GPT-3.5-family/ text-davinci-002 models with Chain-of-Thought prompting across four benchmarks (HotpotQA, EntailmentBank, QASC, bAbI15), systematically probing internal versus external knowledge, non-sequential reasoning, and generalisation to higher hop counts. Key findings show that external knowledge can boost performance but is sensitive to distractors and prompt-test consistency, that non-sequential reasoning remains poorly supported by current CoT methods, and that generalisation to more hops is limited with notable decomposition and reasoning-path errors. The results highlight the need for improved retrieval-augmented reasoning and more robust problem decomposition to approach human-like multi-hop reasoning in real-world tasks.

Abstract

We carry out a series of experiments to test large language models' multi-hop reasoning ability from three aspects: selecting and combining external knowledge, dealing with non-sequential reasoning tasks and generalising to data samples with larger numbers of hops. We test the GPT-3.5 model on four reasoning benchmarks with Chain-of-Thought prompting (and its variations). Our results reveal that despite the amazing performance achieved by large language models on various reasoning tasks, models still suffer from severe drawbacks which shows a large gap with humans.

Large Language Models Still Face Challenges in Multi-Hop Reasoning with External Knowledge

TL;DR

The paper addresses the persistent challenges of large language models in multi-hop reasoning that relies on external knowledge. It evaluates GPT-3.5-family/ text-davinci-002 models with Chain-of-Thought prompting across four benchmarks (HotpotQA, EntailmentBank, QASC, bAbI15), systematically probing internal versus external knowledge, non-sequential reasoning, and generalisation to higher hop counts. Key findings show that external knowledge can boost performance but is sensitive to distractors and prompt-test consistency, that non-sequential reasoning remains poorly supported by current CoT methods, and that generalisation to more hops is limited with notable decomposition and reasoning-path errors. The results highlight the need for improved retrieval-augmented reasoning and more robust problem decomposition to approach human-like multi-hop reasoning in real-world tasks.

Abstract

We carry out a series of experiments to test large language models' multi-hop reasoning ability from three aspects: selecting and combining external knowledge, dealing with non-sequential reasoning tasks and generalising to data samples with larger numbers of hops. We test the GPT-3.5 model on four reasoning benchmarks with Chain-of-Thought prompting (and its variations). Our results reveal that despite the amazing performance achieved by large language models on various reasoning tasks, models still suffer from severe drawbacks which shows a large gap with humans.

Paper Structure

This paper contains 18 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Two examples on HotpotQA and EntailmentBank respectively where providing external knowledge helps the model answer the question correctly. Best viewed in colour.
  • Figure 2: Distribution of length of proof of generated reasoning paths in different settings. X-axis indicates the lengths of proof. Y-axis indicates different settings.