Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models
Yue Zhou, Yada Zhu, Diego Antognini, Yoon Kim, Yang Zhang
TL;DR
The paper reveals that the surface form of math problems can dramatically influence LLM solvability, even when semantics are unchanged. It introduces Self-Consistency-over-Paraphrases (SCoP), which generates multiple paraphrase surface forms and aggregates reasoning paths across them to improve accuracy, especially for problems initially deemed unsolvable. Across GSM8K, AQuA, MATH, and MMLU using LLaMA-2-70b, GPT-3.5-Turbo, and GPT-4, SCoP consistently outperforms vanilla Self-Consistency and reveals notable cross-model difficulty alignment and narrower robustness gaps, quantified by the proposed Variance of Variations (VOV). The work provides practical prompts and exemplar-search strategies, analyzes transferability of paraphrases across models, and suggests directions for building more surface-form robust mathematical reasoning in LLMs.
Abstract
This paper studies the relationship between the surface form of a mathematical problem and its solvability by large language models. We find that subtle alterations in the surface form can significantly impact the answer distribution and the solve rate, exposing the language model's lack of robustness and sensitivity to the surface form in reasoning through complex problems. To improve mathematical reasoning performance, we propose Self-Consistency-over-Paraphrases (SCoP), which diversifies reasoning paths from specific surface forms of the problem. We evaluate our approach on four mathematics reasoning benchmarks over three large language models and show that SCoP improves mathematical reasoning performance over vanilla self-consistency, particularly for problems initially deemed unsolvable. Finally, we provide additional experiments and discussion regarding problem difficulty and surface forms, including cross-model difficulty agreement and paraphrasing transferability, and Variance of Variations (VOV) for language model evaluation.
