Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?

Bangzheng Li; Ben Zhou; Fei Wang; Xingyu Fu; Dan Roth; Muhao Chen

Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?

Bangzheng Li, Ben Zhou, Fei Wang, Xingyu Fu, Dan Roth, Muhao Chen

TL;DR

The paper introduces EureQA, a QA benchmark designed to probe whether LLMs follow extended reasoning chains or rely on deceptive semantic shortcuts. By constructing multi-hop chains from DBpedia and masking entities with hypernyms to remove surface cues, EureQA forces models to execute explicit reasoning steps. Across ChatGPT, Gemini-Pro, and GPT-4, results show substantial reliance on semantic associations, with performance dropping as reasoning depth grows, and even human performance outperforming models by wide margins on hard instances. The work highlights a fundamental limitation in current LLM reasoning and provides a data-generation framework and evaluation protocol to drive future improvements in faithful, chain-based inference.

Abstract

Despite the recent advancement in large language models (LLMs) and their high performances across numerous benchmarks, recent research has unveiled that LLMs suffer from hallucinations and unfaithful reasoning. This work studies a specific type of hallucination induced by semantic associations. Specifically, we investigate to what extent LLMs take shortcuts from certain keyword/entity biases in the prompt instead of following the correct reasoning path. To quantify this phenomenon, we propose a novel probing method and benchmark called EureQA. We start from questions that LLMs will answer correctly with utmost certainty, and mask the important entity with evidence sentence recursively, asking models to find masked entities according to a chain of evidence before answering the question. During the construction of the evidence, we purposefully replace semantic clues (entities) that may lead to the correct answer with distractor clues (evidence) that will not directly lead to the correct answer but require a chain-like reasoning process. We evaluate if models can follow the correct reasoning chain instead of short-cutting through distractor clues. We find that existing LLMs lack the necessary capabilities to follow correct reasoning paths and resist the attempt of greedy shortcuts. We show that the distractor semantic associations often lead to model hallucination, which is strong evidence that questions the validity of current LLM reasoning.

Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?

TL;DR

Abstract

Paper Structure (20 sections, 10 figures, 3 tables)

This paper contains 20 sections, 10 figures, 3 tables.

Introduction
EureQA
Reasoning Chain
Question Generation
Question Refinement
Data Statistics
Experiment Setup
Model Configuration
Prompting Methods
Evaluation Protocol
Results
Analysis and Discussions
Can human solve EureQA?
Do LLMs take Shortcuts?
Do open source LLMs perform better?
...and 5 more sections

Figures (10)

Figure 1: Even given all the required information needed for the question (selected information shown in the figure), GPT-4 still makes mistakes starting early layers (highlighted in grey). We only show partial output here. Notice that we give GPT few-shot prompts.
Figure 2: The data generation process of EureQA.
Figure 3: Categorical distribution of seed entities in questions of EureQA.
Figure 4: The correlation between GPT-4 performance on EureQA hard set and entity similarities.
Figure 5: The distribution of entity similarity scores.
...and 5 more figures

Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?

TL;DR

Abstract

Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?

Authors

TL;DR

Abstract

Table of Contents

Figures (10)