Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs

Xing Zi; Xinying Zhou; Jinghao Xiao; Catarina Moreira; Mukesh Prasad

Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs

Xing Zi, Xinying Zhou, Jinghao Xiao, Catarina Moreira, Mukesh Prasad

Abstract

While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is "shortcut learning", where models exploit highly connected, generic hub nodes (e.g., "inflammation") in knowledge graphs to bypass authentic micro-pathological cascades. To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel $k$-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination. Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA's structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: https://shattermed-qa-web.vercel.app/

Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs

Abstract

-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination. Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA's structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: https://shattermed-qa-web.vercel.app/

Paper Structure (35 sections, 13 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 35 sections, 13 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Foundational Medical QA Benchmarks
Cross-Lingual and Global Health Generalization
Knowledge Graph Integration and Synthetic Benchmark Construction
Methodology
Preliminary: The Pitfall of Shortcut Learning
Phase I: Topology-Regularized KG Construction
Phase II: Synthesizing Constrained Diagnostics
Dataset Statistics and Task Distribution
Dataset Quality and Lexical Analysis
Expert Human Validation
Limitations.
Comparison with Existing Benchmarks
Experiments and Evaluation Metrics
...and 20 more sections

Figures (7)

Figure 1: The end-to-end methodological pipeline of ShatterMed-QA.
Figure 2: Distribution of clinical tasks within the ShatterMed-QA benchmark. (a) illustrates the macroscopic task proportions heavily anchored in clinical diagnosis, while (b) demonstrates the structural consistency across different languages (ZH/EN) and difficulty tiers (Easy/Hard).
Figure 3: Bilingual ablation study of frequency threshold $k$ on Knowledge Graph topology. The dual-axis plots contrast inferential complexity (ASP, left axis) with structural connectivity (Largest Component Size, right axis).
Figure 4: The prompt template used for automated taxonomy classification and dataset quality scoring.
Figure 5: Comparative analysis of model behaviors under the topological stress of ShatterMed-QA.
...and 2 more figures

Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs

Abstract

Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs

Authors

Abstract

Table of Contents

Figures (7)