AgentEval: Generative Agents as Reliable Proxies for Human Evaluation of AI-Generated Content
Thanh Vu, Richi Nayak, Thiru Balasubramaniam
TL;DR
AgentEval introduces Generative Agents guided by chain-of-thought to simulate human evaluation of AI-generated content, addressing the cost and reliability issues of human judgments. The framework personalizes agents, leverages perception-memory-planning-reflection, and uses multi-criteria ratings (coherence, relevance, interestingness, fairness, clarity) with voting to define evaluation standards. Experimental results show AgentEval correlates with human judgments more closely than state-of-the-art baselines and achieves lower error metrics, suggesting a scalable alternative for automated content evaluation in business contexts. The work highlights practical implications for reducing annotation costs while maintaining evaluation quality, with future directions in richer personality profiling and criterion optimization.
Abstract
Modern businesses are increasingly challenged by the time and expense required to generate and assess high-quality content. Human writers face time constraints, and extrinsic evaluations can be costly. While Large Language Models (LLMs) offer potential in content creation, concerns about the quality of AI-generated content persist. Traditional evaluation methods, like human surveys, further add operational costs, highlighting the need for efficient, automated solutions. This research introduces Generative Agents as a means to tackle these challenges. These agents can rapidly and cost-effectively evaluate AI-generated content, simulating human judgment by rating aspects such as coherence, interestingness, clarity, fairness, and relevance. By incorporating these agents, businesses can streamline content generation and ensure consistent, high-quality output while minimizing reliance on costly human evaluations. The study provides critical insights into enhancing LLMs for producing business-aligned, high-quality content, offering significant advancements in automated content generation and evaluation.
