Table of Contents
Fetching ...

How often are errors in natural language reasoning due to paraphrastic variability?

Neha Srikanth, Marine Carpuat, Rachel Rudinger

TL;DR

This work addresses how paraphrastic variability impacts measurements of natural language reasoning. It introduces paraphrastic consistency, $P_C$, and links it to variance components via $P_C = \mathbb{E}[\theta^2] + \mathbb{E}[ (1-\theta)^2]$ and $P_C = 1 - 2 \mathbb{E}[\theta (1-\theta)]$, with the related metric $\mathrm{PVAP}$. To study this, the authors construct ParaNlu, a dataset of 7,782 human-written and 7,295 model-generated paraphrases across 1000 reasoning problems spanning defeasible and abductive NLI, and validate paraphrase labels to ensure semantic equivalence. They show that paraphrastic consistency improves with pretraining but not with finetuning, and that no model achieves simultaneously high accuracy and high $P_C$, underscoring the need to evaluate reasoning abilities alongside linguistic robustness. The work suggests using $P_C$ as a diagnostic alongside accuracy to better understand model reasoning and to inform deployment decisions in applications where paraphrase robustness matters.

Abstract

Large language models have been shown to behave inconsistently in response to meaning-preserving paraphrastic inputs. At the same time, researchers evaluate the knowledge and reasoning abilities of these models with test evaluations that do not disaggregate the effect of paraphrastic variability on performance. We propose a metric for evaluating the paraphrastic consistency of natural language reasoning models based on the probability of a model achieving the same correctness on two paraphrases of the same problem. We mathematically connect this metric to the proportion of a model's variance in correctness attributable to paraphrasing. To estimate paraphrastic consistency, we collect ParaNLU, a dataset of 7,782 human-written and validated paraphrased reasoning problems constructed on top of existing benchmark datasets for defeasible and abductive natural language inference. Using ParaNLU, we measure the paraphrastic consistency of several model classes and show that consistency dramatically increases with pretraining but not finetuning. All models tested exhibited room for improvement in paraphrastic consistency.

How often are errors in natural language reasoning due to paraphrastic variability?

TL;DR

This work addresses how paraphrastic variability impacts measurements of natural language reasoning. It introduces paraphrastic consistency, , and links it to variance components via and , with the related metric . To study this, the authors construct ParaNlu, a dataset of 7,782 human-written and 7,295 model-generated paraphrases across 1000 reasoning problems spanning defeasible and abductive NLI, and validate paraphrase labels to ensure semantic equivalence. They show that paraphrastic consistency improves with pretraining but not with finetuning, and that no model achieves simultaneously high accuracy and high , underscoring the need to evaluate reasoning abilities alongside linguistic robustness. The work suggests using as a diagnostic alongside accuracy to better understand model reasoning and to inform deployment decisions in applications where paraphrase robustness matters.

Abstract

Large language models have been shown to behave inconsistently in response to meaning-preserving paraphrastic inputs. At the same time, researchers evaluate the knowledge and reasoning abilities of these models with test evaluations that do not disaggregate the effect of paraphrastic variability on performance. We propose a metric for evaluating the paraphrastic consistency of natural language reasoning models based on the probability of a model achieving the same correctness on two paraphrases of the same problem. We mathematically connect this metric to the proportion of a model's variance in correctness attributable to paraphrasing. To estimate paraphrastic consistency, we collect ParaNLU, a dataset of 7,782 human-written and validated paraphrased reasoning problems constructed on top of existing benchmark datasets for defeasible and abductive natural language inference. Using ParaNLU, we measure the paraphrastic consistency of several model classes and show that consistency dramatically increases with pretraining but not finetuning. All models tested exhibited room for improvement in paraphrastic consistency.
Paper Structure (47 sections, 7 equations, 8 figures, 7 tables)

This paper contains 47 sections, 7 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: $\delta$-NLI instance with a set of paraphrased update sentences. We study paraphrastic consistency, or the probability that a model's prediction for two phrasings of the same problem match.
  • Figure 2: Three scenarios, all with equivalent overall accuracy of 80%, illustrating different distributions of variance in model predictions leading to different $P_C$ values. Buckets represent underlying commonsense reasoning problems. Numbers within buckets represent model correctness on 5 paraphrases. Most models achieve a mix of accuracy within and across buckets of paraphrased examples.
  • Figure 3: Paraphrastic consistency ($\widetilde{P}_C$) of different models on $\delta$-SNLI paraphrased examples. All models still have room for improvement in $\widetilde{P}_C$ for their accuracy level. Here, we add supporting lines to denote varying levels of the proportion of variance attributable to paraphrasing, or pvap.
  • Figure 4: RoBERTa-large model on automatic versus human paraphrases. Models are more consistent on automatic than human paraphrases. Dashed lines indicate varying levels of pvap.
  • Figure 5: Paraphrastic consistency monotonically increases as a model sees more pretraining tokens, but grows rapidly during early pretraining.
  • ...and 3 more figures