Table of Contents
Fetching ...

EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI

Longfei Zuo, Barbara Plank, Siyao Peng

TL;DR

EVADE presents an LLM-based pipeline to generate and validate explanations for NLI labels, targeting annotation errors arising from human label variation. By contrasting LLM-generated explanations and validation with human VariErr data, EVADE shows that LLM validation improves alignment with human judgment distributions and can more effectively prune erroneous labels than human-only methods. The framework also demonstrates that using LLM-validated errors to prune data yields better downstream fine-tuning performance, suggesting scalable data quality enhancements for NLI datasets. Overall, EVADE reduces human effort while maintaining or improving dataset quality and model alignment with human variation, highlighting practical benefits for robust NLI systems.

Abstract

High-quality datasets are critical for training and evaluating reliable NLP models. In tasks like natural language inference (NLI), human label variation (HLV) arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation. An earlier framework VARIERR (Weber-Genzel et al., 2024) asks multiple annotators to explain their label decisions in the first round and flag errors via validity judgments in the second round. However, conducting two rounds of manual annotation is costly and may limit the coverage of plausible labels or explanations. Our study proposes a new framework, EVADE, for generating and validating explanations to detect errors using large language models (LLMs). We perform a comprehensive analysis comparing human- and LLM-detected errors for NLI across distribution comparison, validation overlap, and impact on model fine-tuning. Our experiments demonstrate that LLM validation refines generated explanation distributions to more closely align with human annotations, and that removing LLM-detected errors from training data yields improvements in fine-tuning performance than removing errors identified by human annotators. This highlights the potential to scale error detection, reducing human effort while improving dataset quality under label variation.

EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI

TL;DR

EVADE presents an LLM-based pipeline to generate and validate explanations for NLI labels, targeting annotation errors arising from human label variation. By contrasting LLM-generated explanations and validation with human VariErr data, EVADE shows that LLM validation improves alignment with human judgment distributions and can more effectively prune erroneous labels than human-only methods. The framework also demonstrates that using LLM-validated errors to prune data yields better downstream fine-tuning performance, suggesting scalable data quality enhancements for NLI datasets. Overall, EVADE reduces human effort while maintaining or improving dataset quality and model alignment with human variation, highlighting practical benefits for robust NLI systems.

Abstract

High-quality datasets are critical for training and evaluating reliable NLP models. In tasks like natural language inference (NLI), human label variation (HLV) arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation. An earlier framework VARIERR (Weber-Genzel et al., 2024) asks multiple annotators to explain their label decisions in the first round and flag errors via validity judgments in the second round. However, conducting two rounds of manual annotation is costly and may limit the coverage of plausible labels or explanations. Our study proposes a new framework, EVADE, for generating and validating explanations to detect errors using large language models (LLMs). We perform a comprehensive analysis comparing human- and LLM-detected errors for NLI across distribution comparison, validation overlap, and impact on model fine-tuning. Our experiments demonstrate that LLM validation refines generated explanation distributions to more closely align with human annotations, and that removing LLM-detected errors from training data yields improvements in fine-tuning performance than removing errors identified by human annotators. This highlights the potential to scale error detection, reducing human effort while improving dataset quality under label variation.

Paper Structure

This paper contains 33 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Overview of our LLM-based Evade framework compared with the human-based VariErr pipeline weber-genzel-etal-2024-varierr. The first two modules, explanation generation and validation, are the core components. Compared with VariErr, our Evade framework provides broader explanation coverage, requires less human intervention, and delivers better downstream performance in predicting label distributions.
  • Figure 2: (a) shows the KLD curves between model distributions and ChaosNLI annotations across three prompting scenarios with validation threshold from 0.1 to 0.9. (b) and (c) present the precision and recall of the LLM-validated labels, computed against the VariErr-validated labels as ground truth, with validation thresholds from 0.1 to 0.9.
  • Figure 3: Explanation generation prompt.
  • Figure 4: Validation prompt for one-expl scenario.
  • Figure 5: Validation prompt for one-llm and all-llm scenarios.