Table of Contents
Fetching ...

Exploring LLM Reasoning Through Controlled Prompt Variations

Giannis Chatziveroglou, Richard Yun, Maura Kelleher

TL;DR

The paper investigates how large language models maintain or lose mathematical reasoning quality when faced with controlled prompt perturbations. Using the GSM8K dataset, four perturbation types (irrelevant context, pathological additions, relevant context, and a combo of the latter two) are systematically added to prompts and evaluated across 13 models. Key findings show that irrelevant context causes the largest degradation, while the effect of perturbations is relatively stable across the number of reasoning steps and is not strongly predicted by model size; in some cases, perturbations even trigger chain-of-thought-like behavior without explicit prompting. The work highlights critical vulnerabilities in current LLMs and argues for robustness-enhancing methods in training and prompting to improve real-world reliability under noisy inputs. Overall, the study provides quantitative benchmarks and qualitative observations that can guide robust prompt design and future architectural developments for trustworthy reasoning.

Abstract

This study investigates the reasoning robustness of large language models (LLMs) on mathematical problem-solving tasks under systematically introduced input perturbations. Using the GSM8K dataset as a controlled testbed, we evaluate how well state-of-the-art models maintain logical consistency and correctness when confronted with four categories of prompt perturbations: irrelevant context, pathological instructions, factually relevant but non-essential context, and a combination of the latter two. Our experiments, conducted on thirteen open-source and closed-source LLMs, reveal that introducing irrelevant context within the model's context window significantly degrades performance, suggesting that distinguishing essential from extraneous details remains a pressing challenge. Surprisingly, performance regressions are relatively insensitive to the complexity of the reasoning task, as measured by the number of steps required, and are not strictly correlated with model size. Moreover, we observe that certain perturbations inadvertently trigger chain-of-thought-like reasoning behaviors, even without explicit prompting. Our findings highlight critical vulnerabilities in current LLMs and underscore the need for improved robustness against noisy, misleading, and contextually dense inputs, paving the way for more resilient and reliable reasoning in real-world applications.

Exploring LLM Reasoning Through Controlled Prompt Variations

TL;DR

The paper investigates how large language models maintain or lose mathematical reasoning quality when faced with controlled prompt perturbations. Using the GSM8K dataset, four perturbation types (irrelevant context, pathological additions, relevant context, and a combo of the latter two) are systematically added to prompts and evaluated across 13 models. Key findings show that irrelevant context causes the largest degradation, while the effect of perturbations is relatively stable across the number of reasoning steps and is not strongly predicted by model size; in some cases, perturbations even trigger chain-of-thought-like behavior without explicit prompting. The work highlights critical vulnerabilities in current LLMs and argues for robustness-enhancing methods in training and prompting to improve real-world reliability under noisy inputs. Overall, the study provides quantitative benchmarks and qualitative observations that can guide robust prompt design and future architectural developments for trustworthy reasoning.

Abstract

This study investigates the reasoning robustness of large language models (LLMs) on mathematical problem-solving tasks under systematically introduced input perturbations. Using the GSM8K dataset as a controlled testbed, we evaluate how well state-of-the-art models maintain logical consistency and correctness when confronted with four categories of prompt perturbations: irrelevant context, pathological instructions, factually relevant but non-essential context, and a combination of the latter two. Our experiments, conducted on thirteen open-source and closed-source LLMs, reveal that introducing irrelevant context within the model's context window significantly degrades performance, suggesting that distinguishing essential from extraneous details remains a pressing challenge. Surprisingly, performance regressions are relatively insensitive to the complexity of the reasoning task, as measured by the number of steps required, and are not strictly correlated with model size. Moreover, we observe that certain perturbations inadvertently trigger chain-of-thought-like reasoning behaviors, even without explicit prompting. Our findings highlight critical vulnerabilities in current LLMs and underscore the need for improved robustness against noisy, misleading, and contextually dense inputs, paving the way for more resilient and reliable reasoning in real-world applications.

Paper Structure

This paper contains 27 sections, 16 figures.

Figures (16)

  • Figure 1: Sample question, answer data point from the GSM8K dataset
  • Figure 2: Percentage difference in number of correct answers when evaluating various models with different perturbed prompts compared to baseline performance with the original prompt.
  • Figure 3: Sample question from the GSM8K dataset test split
  • Figure 4: Breakdown of the GSM8K test set based on the number of reasoning steps needed in the answer.
  • Figure 5: Sample question augmented with irrelevant context
  • ...and 11 more figures