Table of Contents
Fetching ...

A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

TL;DR

This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator, and concludes that chat-aligned models in zero-shot are the best option for carrying out the task.

Abstract

This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator. We show that traditional automatic metrics correlate poorly with human judgements and fail to capture the nuanced relationship between generated CNs and human perception. To alleviate this, we introduce a model ranking pipeline based on pairwise comparisons of generated CNs from different models, organized in a tournament-style format. The proposed evaluation method achieves a high correlation with human preference, with a $ρ$ score of 0.88. As an additional contribution, we leverage LLMs as zero-shot CN generators and provide a comparative analysis of chat, instruct, and base models, exploring their respective strengths and limitations. Through meticulous evaluation, including fine-tuning experiments, we elucidate the differences in performance and responsiveness to domain-specific data. We conclude that chat-aligned models in zero-shot are the best option for carrying out the task, provided they do not refuse to generate an answer due to security concerns.

A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation

TL;DR

This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator, and concludes that chat-aligned models in zero-shot are the best option for carrying out the task.

Abstract

This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator. We show that traditional automatic metrics correlate poorly with human judgements and fail to capture the nuanced relationship between generated CNs and human perception. To alleviate this, we introduce a model ranking pipeline based on pairwise comparisons of generated CNs from different models, organized in a tournament-style format. The proposed evaluation method achieves a high correlation with human preference, with a score of 0.88. As an additional contribution, we leverage LLMs as zero-shot CN generators and provide a comparative analysis of chat, instruct, and base models, exploring their respective strengths and limitations. Through meticulous evaluation, including fine-tuning experiments, we elucidate the differences in performance and responsiveness to domain-specific data. We conclude that chat-aligned models in zero-shot are the best option for carrying out the task, provided they do not refuse to generate an answer due to security concerns.
Paper Structure (42 sections, 5 figures, 6 tables)

This paper contains 42 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Matrix with the Spearman's rank correlation coefficients among metrics. The last row of the matrix represents the correlation of all the evaluation metrics to human preference. J-LM is short for JudgeLM.
  • Figure 2: Ranking through pairwise comparison based on evaluations of all the JudgeLM size variations across the entire test set.
  • Figure B.1: IAA of the Pairwise Rank-Based evaluation.
  • Figure D.1: Matrix with the Spearman’s rank correlation coefficients among metrics, created using 360 tournaments from CONAN. The last row of the matrix represents the correlation of all the evaluation methods to human preference.
  • Figure D.2: Matrix with the Spearman’s rank correlation coefficients among metrics, created using 360 tournaments from MT-CONAN. The last row of the matrix represents the correlation of all the evaluation methods to human preference.