TestAgent: Automatic Benchmarking and Exploratory Interaction for Evaluating LLMs in Vertical Domains
Wanying Wang, Zeyu Ma, Xuhong Wang, Yangchun Zhang, Pengfei Liu, Mingang Chen
TL;DR
The paper addresses the need for scalable, domain-specific evaluation of LLMs by introducing TestAgent, an automated framework that generates vertical-domain benchmarks via retrieval-augmented generation and refines evaluation criteria through a two-stage process. It then employs an RL-guided, multi-turn interaction strategy to dynamically probe knowledge boundaries and assess professionalism and stability. Across medical, legal, and government domains, TestAgent demonstrates cross-domain benchmark generation and provides richer insights than static benchmarks, including the ability to activate existing benchmarks in a dynamic setting. The approach yields quantitative metrics on dynamism, professionalism, and stability, highlighting the value of adaptive questioning in evaluating domain expertise and reliability.
Abstract
As Large Language Models (LLMs) are increasingly deployed in highly specialized vertical domains, the evaluation of their domain-specific performance becomes critical. However, existing evaluations for vertical domains typically rely on the labor-intensive construction of static single-turn datasets, which present two key limitations: (i) manual data construction is costly and must be repeated for each new domain, and (ii) static single-turn evaluations are misaligned with the dynamic multi-turn interactions in real-world applications, limiting the assessment of professionalism and stability. To address these, we propose TestAgent, a framework for automatic benchmarking and exploratory dynamic evaluation in vertical domains. TestAgent leverages retrieval-augmented generation to create domain-specific questions from user-provided knowledge sources, combined with a two-stage criteria generation process, thereby enabling scalable and automated benchmark creation. Furthermore, it introduces a reinforcement learning-guided multi-turn interaction strategy that adaptively determines question types based on real-time model responses, dynamically probing knowledge boundaries and stability. Extensive experiments across medical, legal, and governmental domains demonstrate that TestAgent enables efficient cross-domain benchmark generation and yields deeper insights into model behavior through dynamic exploratory evaluation. This work establishes a new paradigm for automated and in-depth evaluation of LLMs in vertical domains.
