Table of Contents
Fetching ...

RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

Minjoon Choi

TL;DR

RoParQ introduces a targeted benchmark and metric to quantify cross-paraphrase robustness in closed-book MCQA. It combines paraphrase generation via proprietary models with judge-based filtering to isolate inconsistent confidence and defines XParaCon to measure semantic invariance across paraphrase variants. The authors propose a reasoning-based paraphrase-aware SFT (via LoRA) to align models toward invariant answers, demonstrating that lightweight models can reach robustness levels of larger models. This work addresses superficial memorization in LLMs and offers a practical path toward more reliable, semantically grounded QA systems.

Abstract

Large Language Models (LLMs) often exhibit inconsistent behavior when answering paraphrased questions, suggesting a reliance on surface-level patterns rather than true semantic understanding. To address this limitation, we introduce RoParQ, a benchmark specifically constructed to evaluate cross-paraphrase consistency in closed-book multiple-choice QA. This benchmark is derived from standard datasets by generating paraphrases via proprietary models and selectively retaining examples that elicit inconsistent confidence from a judge model. We further propose XParaCon, a novel evaluation metric that quantifies a model's robustness by measuring the standard deviation of accuracies across question variants. Additionally, we implement a reasoning-based, paraphrase-aware Supervised Fine-Tuning (SFT) strategy designed to align models toward semantic invariance. Our experiments demonstrate that this targeted alignment significantly enhances robustness. Notably, fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models. These results highlight the efficacy of our approach in mitigating superficial memorization and fostering more robust, reliable LLMs.

RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

TL;DR

RoParQ introduces a targeted benchmark and metric to quantify cross-paraphrase robustness in closed-book MCQA. It combines paraphrase generation via proprietary models with judge-based filtering to isolate inconsistent confidence and defines XParaCon to measure semantic invariance across paraphrase variants. The authors propose a reasoning-based paraphrase-aware SFT (via LoRA) to align models toward invariant answers, demonstrating that lightweight models can reach robustness levels of larger models. This work addresses superficial memorization in LLMs and offers a practical path toward more reliable, semantically grounded QA systems.

Abstract

Large Language Models (LLMs) often exhibit inconsistent behavior when answering paraphrased questions, suggesting a reliance on surface-level patterns rather than true semantic understanding. To address this limitation, we introduce RoParQ, a benchmark specifically constructed to evaluate cross-paraphrase consistency in closed-book multiple-choice QA. This benchmark is derived from standard datasets by generating paraphrases via proprietary models and selectively retaining examples that elicit inconsistent confidence from a judge model. We further propose XParaCon, a novel evaluation metric that quantifies a model's robustness by measuring the standard deviation of accuracies across question variants. Additionally, we implement a reasoning-based, paraphrase-aware Supervised Fine-Tuning (SFT) strategy designed to align models toward semantic invariance. Our experiments demonstrate that this targeted alignment significantly enhances robustness. Notably, fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models. These results highlight the efficacy of our approach in mitigating superficial memorization and fostering more robust, reliable LLMs.

Paper Structure

This paper contains 17 sections, 3 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: An example of the LLM generating an incorrect response when given a paraphrased question.
  • Figure 2: Movements of Accuracies and XParaCon scores within each model family, representing the effect of scale and fine-tuning.
  • Figure 3: XParaCon score of each model in the general knowledge subset.
  • Figure 5: Prompt used for multiple choice question answering in the general knowledge subset.
  • Figure 6: Prompt used for multiple choice question answering in the math reasoning subset. The model is instructed to generated its reasoning first since questions in this subset necessarily require step by step reasoning.
  • ...and 3 more figures