Table of Contents
Fetching ...

REV: Information-Theoretic Evaluation of Free-Text Rationales

Hanjie Chen, Faeze Brahman, Xiang Ren, Yangfeng Ji, Yejin Choi, Swabha Swayamdipta

TL;DR

This work introduces REV, a conditional-V-information-based metric for evaluating free-text rationales by measuring the amount of new, label-relevant information they provide beyond the input. Grounded in the CVI framework, REV uses two evaluators to quantify how much a rationale improves label prediction beyond a vacuous baseline, enabling it to penalize vacuous rationales and reward informative ones. Empirical results across CommonsenseQA and NLI datasets show REV aligns more closely with human judgments than existing metrics (LAS, RQ) and is sensitive to input perturbations and prompting regimes, including GPT-3 few-shot rationales and chain-of-thought prompts. The work underscores that reasoning explanations should be valued not just for predictive support but for the unique information they contribute, offering deeper insights into model reasoning when used alongside traditional accuracy metrics.

Abstract

Generating free-text rationales is a promising step towards explainable NLP, yet evaluating such rationales remains a challenge. Existing metrics have mostly focused on measuring the association between the rationale and a given label. We argue that an ideal metric should focus on the new information uniquely provided in the rationale that is otherwise not provided in the input or the label. We investigate this research problem from an information-theoretic perspective using conditional V-information (Hewitt et al., 2021). More concretely, we propose a metric called REV (Rationale Evaluation with conditional V-information), to quantify the amount of new, label-relevant information in a rationale beyond the information already available in the input or the label. Experiments across four benchmarks with reasoning tasks, including chain-of-thought, demonstrate the effectiveness of REV in evaluating rationale-label pairs, compared to existing metrics. We further demonstrate REV is consistent with human judgments on rationale evaluations and provides more sensitive measurements of new information in free-text rationales. When used alongside traditional performance metrics, REV provides deeper insights into models' reasoning and prediction processes.

REV: Information-Theoretic Evaluation of Free-Text Rationales

TL;DR

This work introduces REV, a conditional-V-information-based metric for evaluating free-text rationales by measuring the amount of new, label-relevant information they provide beyond the input. Grounded in the CVI framework, REV uses two evaluators to quantify how much a rationale improves label prediction beyond a vacuous baseline, enabling it to penalize vacuous rationales and reward informative ones. Empirical results across CommonsenseQA and NLI datasets show REV aligns more closely with human judgments than existing metrics (LAS, RQ) and is sensitive to input perturbations and prompting regimes, including GPT-3 few-shot rationales and chain-of-thought prompts. The work underscores that reasoning explanations should be valued not just for predictive support but for the unique information they contribute, offering deeper insights into model reasoning when used alongside traditional accuracy metrics.

Abstract

Generating free-text rationales is a promising step towards explainable NLP, yet evaluating such rationales remains a challenge. Existing metrics have mostly focused on measuring the association between the rationale and a given label. We argue that an ideal metric should focus on the new information uniquely provided in the rationale that is otherwise not provided in the input or the label. We investigate this research problem from an information-theoretic perspective using conditional V-information (Hewitt et al., 2021). More concretely, we propose a metric called REV (Rationale Evaluation with conditional V-information), to quantify the amount of new, label-relevant information in a rationale beyond the information already available in the input or the label. Experiments across four benchmarks with reasoning tasks, including chain-of-thought, demonstrate the effectiveness of REV in evaluating rationale-label pairs, compared to existing metrics. We further demonstrate REV is consistent with human judgments on rationale evaluations and provides more sensitive measurements of new information in free-text rationales. When used alongside traditional performance metrics, REV provides deeper insights into models' reasoning and prediction processes.
Paper Structure (35 sections, 7 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 35 sections, 7 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Our evaluation framework for different free-text rationales ($r$). ${r}_1^*$ is a human-written rationale, ${\hat{r}}_{1,a}$ and ${\hat{r}}_{1,b}$ are two generated rationales for the true label $y_1$. Our metric, Rev, based on CVI hewitt-etal-2021-conditional is able to distinguish all three rationales by measuring how much new and label-relevant information each adds over a vacuous rationale, ${b}$; performance-based evaluations can only distinguish between ${\hat{r}}_{1,a}$ and ${\hat{r}}_{1,b}$. For an (arguably) incorrect label, $y_2$, Rev still gives a positive score highlighting that ${\hat{r}}_2$ is able to provide new information for why it supports $y_2$. Prediction accuracy can be augmented with Rev to provide a fuller interpretability of model decisions.
  • Figure 2: Left: Automatic evaluation results of LAS, RQ and Rev for rationale-label pairs on the ECQA test set. Right: Human evaluation for rationale-label pairs on 230 randomly selected examples from the ECQA test set.
  • Figure 3: Sensitivity test results of Rev, LAS and RQ for X→ RY and X→ YR on the ECQA dataset. The $X$-axis shows different levels of noise ($\sigma^{2}$). We plot the curve of Accuracy (model prediction accuracy) vs. Noise in gray dashed line. We also separate the evaluation results on populations on which the model predictions are correct ("Correct") or incorrect ("Incorrect") in addition to the overall evaluation on all test examples ("Overall").
  • Figure 4: Histograms of human-annotated amount of information and pointwise Rev, LAS and RQ scores on GPT-3 few-shot prompted rationales for gold labels.
  • Figure 5: Distributions of Rev for rationales w.r.t. correct and incorrect predictions produced by GPT-3 and LaMDA respectively. The average Rev scores over all instances, correctly predicted instances and incorrectly predicted instances are marked by gray, blue and red dashed lines respectively.
  • ...and 5 more figures