Table of Contents
Fetching ...

SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text

Reshmi Ghosh, Tianyi Yao, Lizzy Chen, Sadid Hasan, Tianwei Chen, Dario Bernal, Huitian Jiao, H M Sajjad Hossain

TL;DR

It is shown that the critiquing Agent is able to rectify scores from LLM evaluators, in absence of references/ground-truth labels, thereby reducing the need for labeled data even for complex NLG evaluation scenarios, like the generation of JSON-structured forms/surveys.

Abstract

Large Language Model (LLM) integrations into applications like Microsoft365 suite and Google Workspace for creating/processing documents, emails, presentations, etc. has led to considerable enhancements in productivity and time savings. But as these integrations become more more complex, it is paramount to ensure that the quality of output from the LLM-integrated applications are relevant and appropriate for use. Identifying the need to develop robust evaluation approaches for natural language generation, wherein references/ground labels doesn't exist or isn't amply available, this paper introduces a novel framework called "SAGEval" which utilizes a critiquing Agent to provide feedback on scores generated by LLM evaluators. We show that the critiquing Agent is able to rectify scores from LLM evaluators, in absence of references/ground-truth labels, thereby reducing the need for labeled data even for complex NLG evaluation scenarios, like the generation of JSON-structured forms/surveys with responses in different styles like multiple choice, likert ratings, single choice questions, etc.

SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text

TL;DR

It is shown that the critiquing Agent is able to rectify scores from LLM evaluators, in absence of references/ground-truth labels, thereby reducing the need for labeled data even for complex NLG evaluation scenarios, like the generation of JSON-structured forms/surveys.

Abstract

Large Language Model (LLM) integrations into applications like Microsoft365 suite and Google Workspace for creating/processing documents, emails, presentations, etc. has led to considerable enhancements in productivity and time savings. But as these integrations become more more complex, it is paramount to ensure that the quality of output from the LLM-integrated applications are relevant and appropriate for use. Identifying the need to develop robust evaluation approaches for natural language generation, wherein references/ground labels doesn't exist or isn't amply available, this paper introduces a novel framework called "SAGEval" which utilizes a critiquing Agent to provide feedback on scores generated by LLM evaluators. We show that the critiquing Agent is able to rectify scores from LLM evaluators, in absence of references/ground-truth labels, thereby reducing the need for labeled data even for complex NLG evaluation scenarios, like the generation of JSON-structured forms/surveys with responses in different styles like multiple choice, likert ratings, single choice questions, etc.

Paper Structure

This paper contains 17 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: SAGEval framework. SAGEval engages with a "wiser" role-based agent to validate scores assigned by the first LLM Evaluator for reference-free texts.
  • Figure 2: Open-ended human-drafted and NLG texts like lists, surveys, forms, contains sub-items or entities that are associated with a central theme such as "List of things to pack while traveling", or "Survey on assessing the quality of healthcare services", but these items (bullets in a list, questions in a survey) differ from each other, and it is important to make sure that the variance in open-ended text is coherent and aligned to the central theme.
  • Figure 3: Scores distribution by SAGE Agent compared scores assigned by Evaluator Agent. We find that Evaluator Agent is inclined towards assigning higher ratings (4s and 5s) across all criteria, whereas SAGE Agent is more critical and pushes the score distribution towards 3s and a couple of 2s.
  • Figure 4: Term-topic frequency distributions of suggested aspects or scoring criteria (upto 3) by SAGE Agent for increasing evaluation coverage across 96 data points. We find that along with the pre-defined aspects, SAGE Agents suggests inclusion of Creativity Score and Content Quality Score for >40% of all suggestions.
  • Figure 5: Distribution of annotation scores (between 1-5) assigned to each Scoring Criteria: , by 4 highly experienced linguists who are experience with artificial intelligence. We note that, for the aspect Audience Engagement, there is a dramatic shift in scores which heavily leans towards being low across (1 and 2) for all 4 human annotators.