Table of Contents
Fetching ...

Towards Understanding the Robustness of LLM-based Evaluations under Perturbations

Manav Chaudhary, Harshit Gupta, Savita Bhat, Vasudeva Varma

TL;DR

This work investigates using Google Gemini-1 as an automatic evaluator for subjective NLG metrics in summarization and dialog tasks, comparing its scores and justifications to expert human judgments on SummEval and USR under multiple prompting strategies and perturbations. Using Krippendorff's alpha as the reliability metric, the study finds that while LLM-based evaluations can be more consistent than human raters under normal conditions, they are not robust to adversarial perturbations and can misalign with human judgments. The results highlight a significant robustness gap for using LLMs as standalone evaluators of subjective metrics and suggest directions for strengthening LLM-based evaluation, including testing alternative models and expanding to multilingual and broader NLG tasks. The work underscores the importance of input integrity and robust prompting frameworks when deploying LLMs for automated quality assessment in NLG.

Abstract

Traditional evaluation metrics like BLEU and ROUGE fall short when capturing the nuanced qualities of generated text, particularly when there is no single ground truth. In this paper, we explore the potential of Large Language Models (LLMs), specifically Google Gemini 1, to serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks. We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments on the SummEval and USR datasets, asking the model to generate both a score as well as a justification for the score. Furthermore, we explore the robustness of the LLM evaluator by using perturbed inputs. Our findings suggest that while LLMs show promise, their alignment with human evaluators is limited, they are not robust against perturbations and significant improvements are required for their standalone use as reliable evaluators for subjective metrics.

Towards Understanding the Robustness of LLM-based Evaluations under Perturbations

TL;DR

This work investigates using Google Gemini-1 as an automatic evaluator for subjective NLG metrics in summarization and dialog tasks, comparing its scores and justifications to expert human judgments on SummEval and USR under multiple prompting strategies and perturbations. Using Krippendorff's alpha as the reliability metric, the study finds that while LLM-based evaluations can be more consistent than human raters under normal conditions, they are not robust to adversarial perturbations and can misalign with human judgments. The results highlight a significant robustness gap for using LLMs as standalone evaluators of subjective metrics and suggest directions for strengthening LLM-based evaluation, including testing alternative models and expanding to multilingual and broader NLG tasks. The work underscores the importance of input integrity and robust prompting frameworks when deploying LLMs for automated quality assessment in NLG.

Abstract

Traditional evaluation metrics like BLEU and ROUGE fall short when capturing the nuanced qualities of generated text, particularly when there is no single ground truth. In this paper, we explore the potential of Large Language Models (LLMs), specifically Google Gemini 1, to serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks. We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments on the SummEval and USR datasets, asking the model to generate both a score as well as a justification for the score. Furthermore, we explore the robustness of the LLM evaluator by using perturbed inputs. Our findings suggest that while LLMs show promise, their alignment with human evaluators is limited, they are not robust against perturbations and significant improvements are required for their standalone use as reliable evaluators for subjective metrics.

Paper Structure

This paper contains 32 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Perturbation in action.