Table of Contents
Fetching ...

AgentEval: Generative Agents as Reliable Proxies for Human Evaluation of AI-Generated Content

Thanh Vu, Richi Nayak, Thiru Balasubramaniam

TL;DR

AgentEval introduces Generative Agents guided by chain-of-thought to simulate human evaluation of AI-generated content, addressing the cost and reliability issues of human judgments. The framework personalizes agents, leverages perception-memory-planning-reflection, and uses multi-criteria ratings (coherence, relevance, interestingness, fairness, clarity) with voting to define evaluation standards. Experimental results show AgentEval correlates with human judgments more closely than state-of-the-art baselines and achieves lower error metrics, suggesting a scalable alternative for automated content evaluation in business contexts. The work highlights practical implications for reducing annotation costs while maintaining evaluation quality, with future directions in richer personality profiling and criterion optimization.

Abstract

Modern businesses are increasingly challenged by the time and expense required to generate and assess high-quality content. Human writers face time constraints, and extrinsic evaluations can be costly. While Large Language Models (LLMs) offer potential in content creation, concerns about the quality of AI-generated content persist. Traditional evaluation methods, like human surveys, further add operational costs, highlighting the need for efficient, automated solutions. This research introduces Generative Agents as a means to tackle these challenges. These agents can rapidly and cost-effectively evaluate AI-generated content, simulating human judgment by rating aspects such as coherence, interestingness, clarity, fairness, and relevance. By incorporating these agents, businesses can streamline content generation and ensure consistent, high-quality output while minimizing reliance on costly human evaluations. The study provides critical insights into enhancing LLMs for producing business-aligned, high-quality content, offering significant advancements in automated content generation and evaluation.

AgentEval: Generative Agents as Reliable Proxies for Human Evaluation of AI-Generated Content

TL;DR

AgentEval introduces Generative Agents guided by chain-of-thought to simulate human evaluation of AI-generated content, addressing the cost and reliability issues of human judgments. The framework personalizes agents, leverages perception-memory-planning-reflection, and uses multi-criteria ratings (coherence, relevance, interestingness, fairness, clarity) with voting to define evaluation standards. Experimental results show AgentEval correlates with human judgments more closely than state-of-the-art baselines and achieves lower error metrics, suggesting a scalable alternative for automated content evaluation in business contexts. The work highlights practical implications for reducing annotation costs while maintaining evaluation quality, with future directions in richer personality profiling and criterion optimization.

Abstract

Modern businesses are increasingly challenged by the time and expense required to generate and assess high-quality content. Human writers face time constraints, and extrinsic evaluations can be costly. While Large Language Models (LLMs) offer potential in content creation, concerns about the quality of AI-generated content persist. Traditional evaluation methods, like human surveys, further add operational costs, highlighting the need for efficient, automated solutions. This research introduces Generative Agents as a means to tackle these challenges. These agents can rapidly and cost-effectively evaluate AI-generated content, simulating human judgment by rating aspects such as coherence, interestingness, clarity, fairness, and relevance. By incorporating these agents, businesses can streamline content generation and ensure consistent, high-quality output while minimizing reliance on costly human evaluations. The study provides critical insights into enhancing LLMs for producing business-aligned, high-quality content, offering significant advancements in automated content generation and evaluation.

Paper Structure

This paper contains 20 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The framework of AgentEval includes two main components: Chain-of-Thoughts and Generative Agent. In the initial interaction with the Agent, we provide the Task introduction to prepare it for upcoming actions. We then define evaluation criteria across five dimensions: Cohenrence, Relevance, Interestingness, Fairness and Clarity. Next, we ask each agent to review our generated articles and rate them on a scale of 1 to 5.
  • Figure 2: A sample prompt from a user to Sarah Persona (agent) about quantifying a 5-star article in terms of coherence. This prompt is repeated for lower ratings and other evaluation metrics to provide insight into each agent's thoughts on good content. We then unify all agents' responses using voting majority to develop our Evaluation Criteria.
  • Figure 3: Feature Importance Across Rating Dimensions. The diagram illustrates the importance of various features on individual rating dimensions. In most evaluation dimensions, the agent's assessments are well-aligned with human judgments, particularly in recognizing that 'Job' is the most significant feature. However, in the 'Interestingness' dimension, the agent diverges from human judgment, placing greater importance on 'Experience' (EXP) over 'Job'
  • Figure 4: Average Rating on articles generated by 2 Large Language Models: GPT4 and Ollama3.1. There have not been significant differences, but while Ollama3.1 seems more interesting, GPT4 seems more fair in content writing.