Table of Contents
Fetching ...

TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains

Wanying Wang, Zeyu Ma, Xuhong Wang, Yangchun Zhang, Pengfei Liu, Mingang Chen

TL;DR

The paper addresses the need for scalable, domain-specific evaluation of LLMs by introducing TestAgent, an automated framework that generates vertical-domain benchmarks via retrieval-augmented generation and refines evaluation criteria through a two-stage process. It then employs an RL-guided, multi-turn interaction strategy to dynamically probe knowledge boundaries and assess professionalism and stability. Across medical, legal, and government domains, TestAgent demonstrates cross-domain benchmark generation and provides richer insights than static benchmarks, including the ability to activate existing benchmarks in a dynamic setting. The approach yields quantitative metrics on dynamism, professionalism, and stability, highlighting the value of adaptive questioning in evaluating domain expertise and reliability.

Abstract

As Large Language Models (LLMs) are increasingly deployed in highly specialized vertical domains, the evaluation of their domain-specific performance becomes critical. However, existing evaluations for vertical domains typically rely on the labor-intensive construction of static single-turn datasets, which present two key limitations: (i) manual data construction is costly and must be repeated for each new domain, and (ii) static single-turn evaluations are misaligned with the dynamic multi-turn interactions in real-world applications, limiting the assessment of professionalism and stability. To address these, we propose TestAgent, a framework for automatic benchmarking and exploratory dynamic evaluation in vertical domains. TestAgent leverages retrieval-augmented generation to create domain-specific questions from user-provided knowledge sources, combined with a two-stage criteria generation process, thereby enabling scalable and automated benchmark creation. Furthermore, it introduces a reinforcement learning-guided multi-turn interaction strategy that adaptively determines question types based on real-time model responses, dynamically probing knowledge boundaries and stability. Extensive experiments across medical, legal, and governmental domains demonstrate that TestAgent enables efficient cross-domain benchmark generation and yields deeper insights into model behavior through dynamic exploratory evaluation. This work establishes a new paradigm for automated and in-depth evaluation of LLMs in vertical domains.

TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains

TL;DR

The paper addresses the need for scalable, domain-specific evaluation of LLMs by introducing TestAgent, an automated framework that generates vertical-domain benchmarks via retrieval-augmented generation and refines evaluation criteria through a two-stage process. It then employs an RL-guided, multi-turn interaction strategy to dynamically probe knowledge boundaries and assess professionalism and stability. Across medical, legal, and government domains, TestAgent demonstrates cross-domain benchmark generation and provides richer insights than static benchmarks, including the ability to activate existing benchmarks in a dynamic setting. The approach yields quantitative metrics on dynamism, professionalism, and stability, highlighting the value of adaptive questioning in evaluating domain expertise and reliability.

Abstract

As Large Language Models (LLMs) are increasingly deployed in highly specialized vertical domains, the evaluation of their domain-specific performance becomes critical. However, existing evaluations for vertical domains typically rely on the labor-intensive construction of static single-turn datasets, which present two key limitations: (i) manual data construction is costly and must be repeated for each new domain, and (ii) static single-turn evaluations are misaligned with the dynamic multi-turn interactions in real-world applications, limiting the assessment of professionalism and stability. To address these, we propose TestAgent, a framework for automatic benchmarking and exploratory dynamic evaluation in vertical domains. TestAgent leverages retrieval-augmented generation to create domain-specific questions from user-provided knowledge sources, combined with a two-stage criteria generation process, thereby enabling scalable and automated benchmark creation. Furthermore, it introduces a reinforcement learning-guided multi-turn interaction strategy that adaptively determines question types based on real-time model responses, dynamically probing knowledge boundaries and stability. Extensive experiments across medical, legal, and governmental domains demonstrate that TestAgent enables efficient cross-domain benchmark generation and yields deeper insights into model behavior through dynamic exploratory evaluation. This work establishes a new paradigm for automated and in-depth evaluation of LLMs in vertical domains.

Paper Structure

This paper contains 23 sections, 2 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Statistical results of evaluation demand and bottleneck in the LLMs industry.
  • Figure 2: Illustration of TestAgent framework. Leveraging a RAG system built upon the domain knowledge base and topics, it first generates initial questions along with a two-stage criteria construction process, progressing from general topic-level criteria to refined question-specific criteria. Reinforcement Learning then guides the subsequent questioning strategy, determining whether to issue challenges or generate follow-up questions. A multi-dimensional analysis of dynamism, professionalism, and stability is ultimately derived from these interactions.
  • Figure 3: Strategic evaluation performance across domains. Comparison of TestAgent with Untrained baseline and Q-Embedding baseline, where the RL state space incorporates question embeddings. The dashed horizontal line at 0.5 represents random selection.
  • Figure 4: Static single-turn evaluation versus TestAgent dynamic evaluation in the legal domain.