Table of Contents
Fetching ...

Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks

Jing Yang, Max Glockner, Anderson Rocha, Iryna Gurevych

TL;DR

Self-Rationalization in the Wild investigates how free-text explanations can transfer to out-of-distribution tasks when trained on limited annotated data. The authors fine-tune T5-Large and OLMo-7B on explanation-rich sources (e-SNLI and e-FEVER) and evaluate on 19 OOD datasets spanning NLI, FC, and HDAS, using an Acceptability-based metric to assess explanation quality without gold references. They show that few-shot fine-tuning with carefully chosen data sources can match or exceed full-dataset performance, and that the source dataset has a larger impact than sampling strategy on OOD label prediction. Acceptability correlates best with human judgments, suggesting its value for evaluating explanations in real-world, diverse settings. The results offer practical guidance for building robust, explainable NLP systems in varied domains and highlight trade-offs between label accuracy and explanation quality.

Abstract

Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data, making it challenging to train models for explainable predictions. To address this, we investigate how to use existing explanation datasets for self-rationalization and evaluate models' out-of-distribution (OOD) performance. We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods. The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization. For the generated explanation evaluation, we conduct a human study on 13 selected models and study its correlation with the Acceptability score (T5-11B) and three other LLM-based reference-free metrics. Human evaluation shows that the Acceptability score correlates most strongly with human judgments, demonstrating its effectiveness in evaluating free-text explanations. Our findings reveal: 1) few annotated examples effectively adapt models for OOD explanation generation; 2) compared to sample selection strategies, fine-tuning data source has a larger impact on OOD performance; and 3) models with higher label prediction accuracy tend to produce better explanations, as reflected by higher Acceptability scores.

Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks

TL;DR

Self-Rationalization in the Wild investigates how free-text explanations can transfer to out-of-distribution tasks when trained on limited annotated data. The authors fine-tune T5-Large and OLMo-7B on explanation-rich sources (e-SNLI and e-FEVER) and evaluate on 19 OOD datasets spanning NLI, FC, and HDAS, using an Acceptability-based metric to assess explanation quality without gold references. They show that few-shot fine-tuning with carefully chosen data sources can match or exceed full-dataset performance, and that the source dataset has a larger impact than sampling strategy on OOD label prediction. Acceptability correlates best with human judgments, suggesting its value for evaluating explanations in real-world, diverse settings. The results offer practical guidance for building robust, explainable NLP systems in varied domains and highlight trade-offs between label accuracy and explanation quality.

Abstract

Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data, making it challenging to train models for explainable predictions. To address this, we investigate how to use existing explanation datasets for self-rationalization and evaluate models' out-of-distribution (OOD) performance. We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods. The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization. For the generated explanation evaluation, we conduct a human study on 13 selected models and study its correlation with the Acceptability score (T5-11B) and three other LLM-based reference-free metrics. Human evaluation shows that the Acceptability score correlates most strongly with human judgments, demonstrating its effectiveness in evaluating free-text explanations. Our findings reveal: 1) few annotated examples effectively adapt models for OOD explanation generation; 2) compared to sample selection strategies, fine-tuning data source has a larger impact on OOD performance; and 3) models with higher label prediction accuracy tend to produce better explanations, as reflected by higher Acceptability scores.

Paper Structure

This paper contains 61 sections, 2 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: OOD evaluation pipeline of self-rationalization. The pipeline comprises two main parts. The first part (a) relates to learning to self-rationalize with a source dataset (Section \ref{['fine-tuning']}); it involves sample selection and fine-tuning a generative model. The second part (b) relates to OOD generation and evaluation (Section \ref{['ood']}); we evaluate the model on three categories of OOD tasks: NLI, fact-checking, and hallucination detection of abstractive summarization.
  • Figure 2: Average Macro F1 score across different number of shots and sample selection methods. Each point is the average of all 19 OOD datasets, and 5 models from the 5 subsets.
  • Figure 3: Distribution of models under different fine-tuning factors, with the x-axis showing the Acceptability score, and the y-axis the macro F1 score (scores are averaged over all datasets). The dashed lines are the estimated linear trends of the Acceptability score and macro F1 score.
  • Figure 4: Distribution of label prediction accuracy (balanced) across different Acceptability score ranges. The left y-axis shows the balanced accuracy of samples from that Acceptability score range, and the right y-axis shows the percentage of samples in that range.
  • Figure 5: Screenshots of human evaluation interface
  • ...and 3 more figures