Table of Contents
Fetching ...

Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training

Yihang Yao, Zhepeng Cen, Miao Li, William Han, Yuyou Zhang, Emerson Liu, Zuxin Liu, Chuang Gan, Ding Zhao

TL;DR

This work tackles the brittleness of LLM reasoning to surface-form variations by introducing symmetry-aware data augmentation, MEND, which augments post-training data with permutation and redundancy transformations to enforce invariant knowledge extraction. By formalizing reasoning on a DAG and defining reasoning consistency as stability across semantically equivalent queries, the method shows improved data efficiency and stronger OOD generalization across logical and arithmetic tasks. Empirical results demonstrate that MEND outperforms reasoning-chain augmentation baselines and inference-time paraphrasing baselines, while a probing tool confirms enhanced in-context knowledge extraction. The findings suggest that structured dataset curation focused on query symmetry can meaningfully boost LLM robustness in reasoning tasks, with implications for more reliable deployment in diverse prompt settings.

Abstract

Large Language Models (LLMs) have demonstrated strong reasoning capabilities across various tasks. However, even minor variations in query phrasing, despite preserving the underlying semantic meaning, can significantly affect their performance. To address this, we focus on enhancing LLMs' awareness of symmetry in query variations and propose syMmetry-ENhanceD (MEND) Data Augmentation, a data-centric approach that improves the model's ability to extract useful information from context. Unlike existing methods that emphasize reasoning chain augmentation, our approach improves model robustness at the knowledge extraction stage through query augmentations, enabling more data-efficient training and stronger generalization to Out-of-Distribution (OOD) settings. Extensive experiments on both logical and arithmetic reasoning tasks show that MEND enhances reasoning performance across diverse query variations, providing new insight into improving LLM robustness through structured dataset curation.

Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training

TL;DR

This work tackles the brittleness of LLM reasoning to surface-form variations by introducing symmetry-aware data augmentation, MEND, which augments post-training data with permutation and redundancy transformations to enforce invariant knowledge extraction. By formalizing reasoning on a DAG and defining reasoning consistency as stability across semantically equivalent queries, the method shows improved data efficiency and stronger OOD generalization across logical and arithmetic tasks. Empirical results demonstrate that MEND outperforms reasoning-chain augmentation baselines and inference-time paraphrasing baselines, while a probing tool confirms enhanced in-context knowledge extraction. The findings suggest that structured dataset curation focused on query symmetry can meaningfully boost LLM robustness in reasoning tasks, with implications for more reliable deployment in diverse prompt settings.

Abstract

Large Language Models (LLMs) have demonstrated strong reasoning capabilities across various tasks. However, even minor variations in query phrasing, despite preserving the underlying semantic meaning, can significantly affect their performance. To address this, we focus on enhancing LLMs' awareness of symmetry in query variations and propose syMmetry-ENhanceD (MEND) Data Augmentation, a data-centric approach that improves the model's ability to extract useful information from context. Unlike existing methods that emphasize reasoning chain augmentation, our approach improves model robustness at the knowledge extraction stage through query augmentations, enabling more data-efficient training and stronger generalization to Out-of-Distribution (OOD) settings. Extensive experiments on both logical and arithmetic reasoning tasks show that MEND enhances reasoning performance across diverse query variations, providing new insight into improving LLM robustness through structured dataset curation.

Paper Structure

This paper contains 28 sections, 7 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Failure examples of LLMs under surface form variations. Queries are modified from R-GSM chen2024premise. Table: The correctness for $10$ evaluations across different LLMs. : all correct; : all wrong; : error occurs. Full incorrect answers are provided in Appendix \ref{['subsection: close-source model evaluation']}.
  • Figure 2: Overview of Symmetry-Enhanced Data Augmentation and its Comparison with Reasoning Chain Data Augmentation.
  • Figure 3: Accuracy evaluation of the DeepSeek-math-7B-base model on the arithmetic reasoning task with different surface forms. Left: results with different DAG depth; Right: results with different redundant information addition.
  • Figure 4: Evaluations with respect to different query variations. Each figure refers to one permutation order type, the x-axis represents the number of redundancies of the test set, and the y-axis represents the accuracy of final answers. For each dataset, we report the accuracy value over a dataset with a size of $200$.
  • Figure 5: Data efficiency evaluation. R in figure titles indicates the number of redundancy in the query. The size of dataset $=1$ indicates using the original dataset for SFT. All plots are averaged among $3$ random seeds with temperature$=1$. The solid line is the mean value, and the light shade represents the first standard deviation.
  • ...and 6 more figures