Table of Contents
Fetching ...

Rationale-Aware Answer Verification by Pairwise Self-Evaluation

Akira Kawabata, Saku Sugawara

TL;DR

It is demonstrated that Verifiers trained on solutions selected by REPS outperform those trained using conventional training methods on three reasoning benchmarks (ARC-Challenge, DROP, and StrategyQA) and suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers.

Abstract

Answer verification identifies correct solutions among candidates generated by large language models (LLMs). Current approaches typically train verifier models by labeling solutions as correct or incorrect based solely on whether the final answer matches the gold answer. However, this approach neglects any flawed rationale in the solution yielding the correct answer, undermining the verifier's ability to distinguish between sound and flawed rationales. We empirically show that in StrategyQA, only 19% of LLM-generated solutions with correct answers have valid rationales, thus leading to an unreliable verifier. Furthermore, we demonstrate that training a verifier on valid rationales significantly improves its ability to distinguish valid and flawed rationale. To make a better verifier without extra human supervision, we introduce REPS (Rationale Enhancement through Pairwise Selection), a method for selecting valid rationales from candidates by iteratively applying pairwise self-evaluation using the same LLM that generates the solutions. Verifiers trained on solutions selected by REPS outperform those trained using conventional training methods on three reasoning benchmarks (ARC-Challenge, DROP, and StrategyQA). Our results suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers, which would be critical for models assisting humans in solving complex reasoning tasks.

Rationale-Aware Answer Verification by Pairwise Self-Evaluation

TL;DR

It is demonstrated that Verifiers trained on solutions selected by REPS outperform those trained using conventional training methods on three reasoning benchmarks (ARC-Challenge, DROP, and StrategyQA) and suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers.

Abstract

Answer verification identifies correct solutions among candidates generated by large language models (LLMs). Current approaches typically train verifier models by labeling solutions as correct or incorrect based solely on whether the final answer matches the gold answer. However, this approach neglects any flawed rationale in the solution yielding the correct answer, undermining the verifier's ability to distinguish between sound and flawed rationales. We empirically show that in StrategyQA, only 19% of LLM-generated solutions with correct answers have valid rationales, thus leading to an unreliable verifier. Furthermore, we demonstrate that training a verifier on valid rationales significantly improves its ability to distinguish valid and flawed rationale. To make a better verifier without extra human supervision, we introduce REPS (Rationale Enhancement through Pairwise Selection), a method for selecting valid rationales from candidates by iteratively applying pairwise self-evaluation using the same LLM that generates the solutions. Verifiers trained on solutions selected by REPS outperform those trained using conventional training methods on three reasoning benchmarks (ARC-Challenge, DROP, and StrategyQA). Our results suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers, which would be critical for models assisting humans in solving complex reasoning tasks.
Paper Structure (51 sections, 5 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 51 sections, 5 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Importance of considering rationale quality in answer verification. Verifiers trained on correct answers with flawed reasoning (blue) fail to identify valid rationales at inference. In contrast, verifiers trained on solutions with correct answers and rationales (yellow) can distinguish valid reasoning.
  • Figure 2: Rationale Accuracy (%) and Answer Accuracy (%) of verifier models trained on datasets with varying levels of rationale quality.
  • Figure 3: Rationale Accuracy (%) and Answer Accuracy (%) as a function of the ratio of high-quality rationales mixed into the baseline dataset.
  • Figure 4: Rationale Enhancement through Pairwise Selection (REPS). The generator model produces candidate solutions and filters out those with incorrect answers. Unlike the conventional pipeline (top), REPS (bottom) employs a tournament-style pairwise evaluation to iteratively select the better solution. This refined solution is then used to train a rationale-aware verifier.
  • Figure 5: The effect of varying the number of candidate solutions ($N$) and the number of pairwise comparisons per match ($S$) on the Rationale Accuracy (%) and average length of selected rationales. Increasing N and S leads to a decrease in Rationale Accuracy and an increase in the average length of selected rationales.
  • ...and 1 more figures