Table of Contents
Fetching ...

Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models

Yu Yuan, Lili Zhao, Kai Zhang, Guangting Zheng, Qi Liu

TL;DR

The paper tackles shortcut learning in LLMs by introducing Shortcut Suite, a comprehensive benchmark with six shortcut types, five metrics, and four prompting strategies to quantify how spurious cues degrade robustness and generalization. By evaluating diverse closed- and open-source LLMs on NLI and extended SA/PI tasks, it shows larger models are more prone to shortcut reliance in zero-shot and few-shot settings, while Chain-of-Thought prompting mitigates this tendency and improves reasoning. The work also introduces novel explanation-focused measures (SFS, ICS, EQS) and analyzes confidence, prediction distributions, and error types (distraction, disguised comprehension, logical fallacy) to characterize why models rely on shortcuts. Findings suggest CoT prompting as a promising mitigation path and provide a foundation for future methods, such as unbiased data fine-tuning and retrieval-augmented reasoning, to enhance robustness and generalization in LLMs.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities in various natural language processing tasks. However, LLMs may rely on dataset biases as shortcuts for prediction, which can significantly impair their robustness and generalization capabilities. This paper presents Shortcut Suite, a comprehensive test suite designed to evaluate the impact of shortcuts on LLMs' performance, incorporating six shortcut types, five evaluation metrics, and four prompting strategies. Our extensive experiments yield several key findings: 1) LLMs demonstrate varying reliance on shortcuts for downstream tasks, significantly impairing their performance. 2) Larger LLMs are more likely to utilize shortcuts under zero-shot and few-shot in-context learning prompts. 3) Chain-of-thought prompting notably reduces shortcut reliance and outperforms other prompting strategies, while few-shot prompts generally underperform compared to zero-shot prompts. 4) LLMs often exhibit overconfidence in their predictions, especially when dealing with datasets that contain shortcuts. 5) LLMs generally have a lower explanation quality in shortcut-laden datasets, with errors falling into three types: distraction, disguised comprehension, and logical fallacy. Our findings offer new insights for evaluating robustness and generalization in LLMs and suggest potential directions for mitigating the reliance on shortcuts. The code is available at \url {https://github.com/yyhappier/ShortcutSuite.git}.

Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models

TL;DR

The paper tackles shortcut learning in LLMs by introducing Shortcut Suite, a comprehensive benchmark with six shortcut types, five metrics, and four prompting strategies to quantify how spurious cues degrade robustness and generalization. By evaluating diverse closed- and open-source LLMs on NLI and extended SA/PI tasks, it shows larger models are more prone to shortcut reliance in zero-shot and few-shot settings, while Chain-of-Thought prompting mitigates this tendency and improves reasoning. The work also introduces novel explanation-focused measures (SFS, ICS, EQS) and analyzes confidence, prediction distributions, and error types (distraction, disguised comprehension, logical fallacy) to characterize why models rely on shortcuts. Findings suggest CoT prompting as a promising mitigation path and provide a foundation for future methods, such as unbiased data fine-tuning and retrieval-augmented reasoning, to enhance robustness and generalization in LLMs.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities in various natural language processing tasks. However, LLMs may rely on dataset biases as shortcuts for prediction, which can significantly impair their robustness and generalization capabilities. This paper presents Shortcut Suite, a comprehensive test suite designed to evaluate the impact of shortcuts on LLMs' performance, incorporating six shortcut types, five evaluation metrics, and four prompting strategies. Our extensive experiments yield several key findings: 1) LLMs demonstrate varying reliance on shortcuts for downstream tasks, significantly impairing their performance. 2) Larger LLMs are more likely to utilize shortcuts under zero-shot and few-shot in-context learning prompts. 3) Chain-of-thought prompting notably reduces shortcut reliance and outperforms other prompting strategies, while few-shot prompts generally underperform compared to zero-shot prompts. 4) LLMs often exhibit overconfidence in their predictions, especially when dealing with datasets that contain shortcuts. 5) LLMs generally have a lower explanation quality in shortcut-laden datasets, with errors falling into three types: distraction, disguised comprehension, and logical fallacy. Our findings offer new insights for evaluating robustness and generalization in LLMs and suggest potential directions for mitigating the reliance on shortcuts. The code is available at \url {https://github.com/yyhappier/ShortcutSuite.git}.

Paper Structure

This paper contains 35 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Shortcut Learning Behavior: The LLM mistakenly infers the premise entails the hypothesis if all subsequences match, skipping deep semantic analysis.
  • Figure 2: Box plots of confidence scores across all datasets under zero-shot CoT prompting (each LLM is denoted by an abbreviation). LLMs generally report confidence scores that significantly exceed their actual accuracy.
  • Figure 3: Label distribution percentages (%) for each LLM’s predictions under zero-shot prompting (each LLM is abbreviated). Distributions for the other three datasets are in Appendix \ref{['sec: appendixA']}.
  • Figure 4: An illustrative example of distraction in LLMs: in the Position dataset, the LLM is observed to be distracted by tautologies, thus ignoring useful information.
  • Figure 5: Label distribution as percentages (%) for LLMs' prediction under zero-shot prompting (each LLM is denoted by an abbreviation).
  • ...and 3 more figures