Table of Contents
Fetching ...

Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey

Shengyue Guan, Jindong Wang, Jiang Bian, Bin Zhu, Jian-guang Lou, Haoyi Xiong

TL;DR

The paper addresses the challenge of evaluating LLM-based agents in multi-turn conversations by proposing a PRISMA-inspired review and two interrelated taxonomies that separate what to evaluate from how to evaluate. It synthesizes hundreds of sources across 2017–2025 to cover end-to-end experience, action/tool-use, memory, and planning, and it surveys data, metrics, and benchmark resources spanning annotation-based, annotation-free, and self-judging approaches. Key contributions include a holistic framework for assessment, detailed categorization of evaluation goals and methodologies, and a catalog of benchmarks such as GAIA and MTU-Bench that enable cross-study comparability. The work advances practical evaluation practices for real-world dialogue systems by outlining challenges (memory retention, scalability, privacy) and proposing directions for automated, adaptive, and privacy-conscious evaluation pipelines.

Abstract

This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. Using a PRISMA-inspired framework, we systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication, and establishing a solid foundation for our analysis. Our study offers a structured approach by developing two interrelated taxonomy systems: one that defines \emph{what to evaluate} and another that explains \emph{how to evaluate}. The first taxonomy identifies key components of LLM-based agents for multi-turn conversations and their evaluation dimensions, including task completion, response quality, user experience, memory and context retention, as well as planning and tool integration. These components ensure that the performance of conversational agents is assessed in a holistic and meaningful manner. The second taxonomy system focuses on the evaluation methodologies. It categorizes approaches into annotation-based evaluations, automated metrics, hybrid strategies that combine human assessments with quantitative measures, and self-judging methods utilizing LLMs. This framework not only captures traditional metrics derived from language understanding, such as BLEU and ROUGE scores, but also incorporates advanced techniques that reflect the dynamic, interactive nature of multi-turn dialogues.

Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey

TL;DR

The paper addresses the challenge of evaluating LLM-based agents in multi-turn conversations by proposing a PRISMA-inspired review and two interrelated taxonomies that separate what to evaluate from how to evaluate. It synthesizes hundreds of sources across 2017–2025 to cover end-to-end experience, action/tool-use, memory, and planning, and it surveys data, metrics, and benchmark resources spanning annotation-based, annotation-free, and self-judging approaches. Key contributions include a holistic framework for assessment, detailed categorization of evaluation goals and methodologies, and a catalog of benchmarks such as GAIA and MTU-Bench that enable cross-study comparability. The work advances practical evaluation practices for real-world dialogue systems by outlining challenges (memory retention, scalability, privacy) and proposing directions for automated, adaptive, and privacy-conscious evaluation pipelines.

Abstract

This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. Using a PRISMA-inspired framework, we systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication, and establishing a solid foundation for our analysis. Our study offers a structured approach by developing two interrelated taxonomy systems: one that defines \emph{what to evaluate} and another that explains \emph{how to evaluate}. The first taxonomy identifies key components of LLM-based agents for multi-turn conversations and their evaluation dimensions, including task completion, response quality, user experience, memory and context retention, as well as planning and tool integration. These components ensure that the performance of conversational agents is assessed in a holistic and meaningful manner. The second taxonomy system focuses on the evaluation methodologies. It categorizes approaches into annotation-based evaluations, automated metrics, hybrid strategies that combine human assessments with quantitative measures, and self-judging methods utilizing LLMs. This framework not only captures traditional metrics derived from language understanding, such as BLEU and ROUGE scores, but also incorporates advanced techniques that reflect the dynamic, interactive nature of multi-turn dialogues.

Paper Structure

This paper contains 34 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: An example of Multi-turn Conversational Agents based on LLMs
  • Figure 2: Selection Process for Papers Evaluating LLM-based Agents in Multi-turn Conversations
  • Figure 3: Taxonomy of Evaluation Approaches for LLM-Based Multi-Turn Conversational Agents: A Comprehensive Survey of Goals, Methodologies, and Future Directions.
  • Figure 4: Taxonomy of Evaluation Memory of LLM-Based Agents in Multi-Turn Conversations.
  • Figure 5: Four Types of User-Agent Interactions zhang2024surveymemorymechanismlarge: Complete Interactions: This refers to the database storing all interactions between the user and the agent. Every conversation or action is captured and retained for future reference; Recent Interactions:In this case, if the user asks for weather details and the time gap exceeds 24 hours, the information is discarded. For example, if the user later asks whether a jacket is needed for a walk, the agent will not know the weather because the data has been removed from the database.; Retrieved Interactions: This occurs when the agent recalls past interactions. For instance, if the user asks for weekend recommendations and the agent remembers that the user enjoyed rock climbing last summer, it may suggest a hiking trip that includes rock climbing; and External Interactions: In this scenario, the user uploads an image, and the agent uses image recognition to analyze the content. The agent might then use a math tool to solve an equation that is written on the image.
  • ...and 6 more figures