Table of Contents
Fetching ...

ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Ana Brassard, Benjamin Heinzerling, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

TL;DR

This work presents ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and observes that larger models outputted labels that maintained or increased the inter-annotator agreement, suggesting that they are within the expected variance between human raters.

Abstract

Evaluating the quality of free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to evaluate how LLMs rate explanations. We observed that larger models outputted labels that maintained or increased the inter-annotator agreement, suggesting that they are within the expected variance between human raters. However, their correlation with majority-voted human ratings varied across different quality aspects, indicating that they are not a complete replacement. In turn, using LLMs as a supplement to a smaller group of human raters in some cases improved the correlation with the original majority labels. However, the effect was limited to cases where human raters were scarce, and an additional human rater had a more pronounced effect in all cases. Overall, we recommend against using LLMs as a complete replacement for human raters but encourage using them in configurations that end with targeted human involvement. Data available here: https://github.com/a-brassard/ACORN

ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

TL;DR

This work presents ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and observes that larger models outputted labels that maintained or increased the inter-annotator agreement, suggesting that they are within the expected variance between human raters.

Abstract

Evaluating the quality of free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to evaluate how LLMs rate explanations. We observed that larger models outputted labels that maintained or increased the inter-annotator agreement, suggesting that they are within the expected variance between human raters. However, their correlation with majority-voted human ratings varied across different quality aspects, indicating that they are not a complete replacement. In turn, using LLMs as a supplement to a smaller group of human raters in some cases improved the correlation with the original majority labels. However, the effect was limited to cases where human raters were scarce, and an additional human rater had a more pronounced effect in all cases. Overall, we recommend against using LLMs as a complete replacement for human raters but encourage using them in configurations that end with targeted human involvement. Data available here: https://github.com/a-brassard/ACORN
Paper Structure (33 sections, 7 figures, 8 tables)

This paper contains 33 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: We collected aspect-wise human ratings for 3,500 textual explanations for commonsense reasoning benchmarks. We compared these against ratings from large language models (LLMs) to evaluate their alignment with human judgments.
  • Figure 2: We collected general and aspect-wise ratings for human-written, LLM-improved, better human-written, and LLM-generated explanations, for BCOPA and CommonsenseQA, respectively.
  • Figure 3: Inter-annotator agreement (Krippendorff's $\alpha$, $*100$ for legibility) between human raters (shaded area) and with the LLM's rating replacing a random rater.
  • Figure 4: Spearman's ranking correlation between majority-voted human labels and LLM-generated ratings ($*100$ for legibility).
  • Figure 5: A comparison of Spearman's rank correlation with the original gold labels when using fewer raters (*H) and when an LLM is added as an additional rater (*H+LLM). From left to right, the number of human raters decreases from four to one (randomly selected). All values are multiplied by 100 for legibility.
  • ...and 2 more figures