Large Language Models Still Face Challenges in Multi-Hop Reasoning with External Knowledge
Haotong Zhang
TL;DR
The paper addresses the persistent challenges of large language models in multi-hop reasoning that relies on external knowledge. It evaluates GPT-3.5-family/ text-davinci-002 models with Chain-of-Thought prompting across four benchmarks (HotpotQA, EntailmentBank, QASC, bAbI15), systematically probing internal versus external knowledge, non-sequential reasoning, and generalisation to higher hop counts. Key findings show that external knowledge can boost performance but is sensitive to distractors and prompt-test consistency, that non-sequential reasoning remains poorly supported by current CoT methods, and that generalisation to more hops is limited with notable decomposition and reasoning-path errors. The results highlight the need for improved retrieval-augmented reasoning and more robust problem decomposition to approach human-like multi-hop reasoning in real-world tasks.
Abstract
We carry out a series of experiments to test large language models' multi-hop reasoning ability from three aspects: selecting and combining external knowledge, dealing with non-sequential reasoning tasks and generalising to data samples with larger numbers of hops. We test the GPT-3.5 model on four reasoning benchmarks with Chain-of-Thought prompting (and its variations). Our results reveal that despite the amazing performance achieved by large language models on various reasoning tasks, models still suffer from severe drawbacks which shows a large gap with humans.
