Table of Contents
Fetching ...

Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models

Yue Zhou, Yada Zhu, Diego Antognini, Yoon Kim, Yang Zhang

TL;DR

The paper reveals that the surface form of math problems can dramatically influence LLM solvability, even when semantics are unchanged. It introduces Self-Consistency-over-Paraphrases (SCoP), which generates multiple paraphrase surface forms and aggregates reasoning paths across them to improve accuracy, especially for problems initially deemed unsolvable. Across GSM8K, AQuA, MATH, and MMLU using LLaMA-2-70b, GPT-3.5-Turbo, and GPT-4, SCoP consistently outperforms vanilla Self-Consistency and reveals notable cross-model difficulty alignment and narrower robustness gaps, quantified by the proposed Variance of Variations (VOV). The work provides practical prompts and exemplar-search strategies, analyzes transferability of paraphrases across models, and suggests directions for building more surface-form robust mathematical reasoning in LLMs.

Abstract

This paper studies the relationship between the surface form of a mathematical problem and its solvability by large language models. We find that subtle alterations in the surface form can significantly impact the answer distribution and the solve rate, exposing the language model's lack of robustness and sensitivity to the surface form in reasoning through complex problems. To improve mathematical reasoning performance, we propose Self-Consistency-over-Paraphrases (SCoP), which diversifies reasoning paths from specific surface forms of the problem. We evaluate our approach on four mathematics reasoning benchmarks over three large language models and show that SCoP improves mathematical reasoning performance over vanilla self-consistency, particularly for problems initially deemed unsolvable. Finally, we provide additional experiments and discussion regarding problem difficulty and surface forms, including cross-model difficulty agreement and paraphrasing transferability, and Variance of Variations (VOV) for language model evaluation.

Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models

TL;DR

The paper reveals that the surface form of math problems can dramatically influence LLM solvability, even when semantics are unchanged. It introduces Self-Consistency-over-Paraphrases (SCoP), which generates multiple paraphrase surface forms and aggregates reasoning paths across them to improve accuracy, especially for problems initially deemed unsolvable. Across GSM8K, AQuA, MATH, and MMLU using LLaMA-2-70b, GPT-3.5-Turbo, and GPT-4, SCoP consistently outperforms vanilla Self-Consistency and reveals notable cross-model difficulty alignment and narrower robustness gaps, quantified by the proposed Variance of Variations (VOV). The work provides practical prompts and exemplar-search strategies, analyzes transferability of paraphrases across models, and suggests directions for building more surface-form robust mathematical reasoning in LLMs.

Abstract

This paper studies the relationship between the surface form of a mathematical problem and its solvability by large language models. We find that subtle alterations in the surface form can significantly impact the answer distribution and the solve rate, exposing the language model's lack of robustness and sensitivity to the surface form in reasoning through complex problems. To improve mathematical reasoning performance, we propose Self-Consistency-over-Paraphrases (SCoP), which diversifies reasoning paths from specific surface forms of the problem. We evaluate our approach on four mathematics reasoning benchmarks over three large language models and show that SCoP improves mathematical reasoning performance over vanilla self-consistency, particularly for problems initially deemed unsolvable. Finally, we provide additional experiments and discussion regarding problem difficulty and surface forms, including cross-model difficulty agreement and paraphrasing transferability, and Variance of Variations (VOV) for language model evaluation.
Paper Structure (31 sections, 1 equation, 5 figures, 10 tables, 1 algorithm)

This paper contains 31 sections, 1 equation, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of the answer distribution and solve rate between surface form variations of a math word problem from GSM8K, when prompted to GPT-3.5-turbo using Self-Consistency, with 40 sampled reasoning paths. Solve rate can vary dramatically between surface forms with equivalent semantics.
  • Figure 2: GSM8K - solve rate difference - from original to one of the random naive paraphrases.
  • Figure 3: A comparison between Self-Consistency and our SCoP. SCoP splits $N$ reasoning paths over $K$ in-context learned paraphrases, instead of devoting all $N$ reasoning paths to the single original problem $P$. The final answer is selected by aggregating all reasoning paths from these paraphrases with a majority vote.
  • Figure 4: (a) GSM8K (b) AQuA (c) MATH (d) MMLU. The average number of data points in the training set needed for obtaining $N_{shot}$ exemplars at different margins.
  • Figure 5: Data Difficulty Map for GSM8K using GPT3.5, with three types of changes from solving the original problem to one of its random paraphrases: (a) Improvement, (b) Overconfidence, and (c) Uncertainty. Arrows indicate the solve rate and entropy change from solving the original problem to its paraphrased version.