Table of Contents
Fetching ...

Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering

Sai Shridhar Balamurali, Lu Cheng

TL;DR

The paper tackles the difficulty of evaluating long-form QA outputs from large language models by critiquing traditional lexical metrics and expensive LLM-based judges. It revisits Natural Language Inference (NLI) as a lightweight evaluation paradigm, enhanced with a simple lexical equivalence signal, and validates it on DIVER-QA, a new human-annotated benchmark spanning five QA datasets and five LLMs. The proposed NLI-based metric and its NLI+lex augmentation demonstrate competitive alignment with human judgments, approaching or matching GPT-4o performance while using far fewer parameters. The work provides open resources (DIVER-QA) and establishes a cost-effective framework for future metric research, with noted limitations and directions for extending to dialogue and multimodal tasks.

Abstract

Evaluating answers from state-of-the-art large language models (LLMs) is challenging: lexical metrics miss semantic nuances, whereas "LLM-as-Judge" scoring is computationally expensive. We re-evaluate a lightweight alternative -- off-the-shelf Natural Language Inference (NLI) scoring augmented by a simple lexical-match flag and find that this decades-old technique matches GPT-4o's accuracy (89.9%) on long-form QA, while requiring orders-of-magnitude fewer parameters. To test human alignment of these metrics rigorously, we introduce DIVER-QA, a new 3000-sample human-annotated benchmark spanning five QA datasets and five candidate LLMs. Our results highlight that inexpensive NLI-based evaluation remains competitive and offer DIVER-QA as an open resource for future metric research.

Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLMs in Question Answering

TL;DR

The paper tackles the difficulty of evaluating long-form QA outputs from large language models by critiquing traditional lexical metrics and expensive LLM-based judges. It revisits Natural Language Inference (NLI) as a lightweight evaluation paradigm, enhanced with a simple lexical equivalence signal, and validates it on DIVER-QA, a new human-annotated benchmark spanning five QA datasets and five LLMs. The proposed NLI-based metric and its NLI+lex augmentation demonstrate competitive alignment with human judgments, approaching or matching GPT-4o performance while using far fewer parameters. The work provides open resources (DIVER-QA) and establishes a cost-effective framework for future metric research, with noted limitations and directions for extending to dialogue and multimodal tasks.

Abstract

Evaluating answers from state-of-the-art large language models (LLMs) is challenging: lexical metrics miss semantic nuances, whereas "LLM-as-Judge" scoring is computationally expensive. We re-evaluate a lightweight alternative -- off-the-shelf Natural Language Inference (NLI) scoring augmented by a simple lexical-match flag and find that this decades-old technique matches GPT-4o's accuracy (89.9%) on long-form QA, while requiring orders-of-magnitude fewer parameters. To test human alignment of these metrics rigorously, we introduce DIVER-QA, a new 3000-sample human-annotated benchmark spanning five QA datasets and five candidate LLMs. Our results highlight that inexpensive NLI-based evaluation remains competitive and offer DIVER-QA as an open resource for future metric research.

Paper Structure

This paper contains 19 sections, 2 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Framework for the NLI+lex model.
  • Figure 2: Compute vs Performance ratio of the metrics used.
  • Figure 3: Human Annotator UI
  • Figure 4: Modelwise MCC scores
  • Figure 5: Modelwise F1 scores
  • ...and 4 more figures