Table of Contents
Fetching ...

GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

Qingchen Yu, Zifan Zheng, Ding Chen, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li

TL;DR

The paper tackles the limitations of static benchmarks for evaluating LLMs by introducing GuessArena, an adaptive, game-based framework for domain-specific knowledge and reasoning evaluation. It combines domain knowledge modeling with a multi-turn interactive reasoning procedure, using automatic card deck construction via retrieval-augmented generation and spectral clustering, and a composite score that integrates reasoning accuracy, efficiency, and knowledge applicability. Experimental results across five industries and nine models show that prompting strategies (cot and knowledge-driven) can reveal distinct strengths in reasoning and domain knowledge, with GPT-4o often serving as a reliable judge. The framework offers improved interpretability, scalability, and adaptability for domain-specific evaluation and is released as open source to support broader adoption and ongoing benchmarking.

Abstract

The evaluation of large language models (LLMs) has traditionally relied on static benchmarks, a paradigm that poses two major limitations: (1) predefined test sets lack adaptability to diverse application domains, and (2) standardized evaluation protocols often fail to capture fine-grained assessments of domain-specific knowledge and contextual reasoning abilities. To overcome these challenges, we propose GuessArena, an adaptive evaluation framework grounded in adversarial game-based interactions. Inspired by the interactive structure of the Guess Who I Am? game, our framework seamlessly integrates dynamic domain knowledge modeling with progressive reasoning assessment to improve evaluation fidelity. Empirical studies across five vertical domains-finance, healthcare, manufacturing, information technology, and education-demonstrate that GuessArena effectively distinguishes LLMs in terms of domain knowledge coverage and reasoning chain completeness. Compared to conventional benchmarks, our method provides substantial advantages in interpretability, scalability, and scenario adaptability.

GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

TL;DR

The paper tackles the limitations of static benchmarks for evaluating LLMs by introducing GuessArena, an adaptive, game-based framework for domain-specific knowledge and reasoning evaluation. It combines domain knowledge modeling with a multi-turn interactive reasoning procedure, using automatic card deck construction via retrieval-augmented generation and spectral clustering, and a composite score that integrates reasoning accuracy, efficiency, and knowledge applicability. Experimental results across five industries and nine models show that prompting strategies (cot and knowledge-driven) can reveal distinct strengths in reasoning and domain knowledge, with GPT-4o often serving as a reliable judge. The framework offers improved interpretability, scalability, and adaptability for domain-specific evaluation and is released as open source to support broader adoption and ongoing benchmarking.

Abstract

The evaluation of large language models (LLMs) has traditionally relied on static benchmarks, a paradigm that poses two major limitations: (1) predefined test sets lack adaptability to diverse application domains, and (2) standardized evaluation protocols often fail to capture fine-grained assessments of domain-specific knowledge and contextual reasoning abilities. To overcome these challenges, we propose GuessArena, an adaptive evaluation framework grounded in adversarial game-based interactions. Inspired by the interactive structure of the Guess Who I Am? game, our framework seamlessly integrates dynamic domain knowledge modeling with progressive reasoning assessment to improve evaluation fidelity. Empirical studies across five vertical domains-finance, healthcare, manufacturing, information technology, and education-demonstrate that GuessArena effectively distinguishes LLMs in terms of domain knowledge coverage and reasoning chain completeness. Compared to conventional benchmarks, our method provides substantial advantages in interpretability, scalability, and scenario adaptability.

Paper Structure

This paper contains 27 sections, 7 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Illustration of the "Guess Who I Am?" game. In this game, two players engage in an interactive process of questioning and reasoning to identify the opponent's chosen card. The player who correctly guesses the target card in the fewest attempts is the winner.
  • Figure 2: Framework of GuessArena. The framework comprises two core components: Domain Knowledge Modeling (Left Panel), which parses and models domain-specific documents to generate a candidate card repository for evaluation; and Interactive Reasoning Evaluation (Right Panel), which employs a multi-turn dialogue mechanism to construct an interactive reasoning game, systematically assessing the model's key capability metrics.
  • Figure 3: Cross-domain GuessArena scores (higher is better) for nine LLMs under three prompting strategies. Grouped bars show the composite GuessArena performance achieved with basic, cot, and knowledge-driven prompts in each of the five industry domains, allowing a visual comparison of prompt effectiveness across models and domains.
  • Figure 4: Interactive guessing trajectories in the healthcare scenario. DeepSeek-V3 (left) and Qwen-2.5-32B-Instruct (right) pose sequential yes/no questions to identify the target card, Pharmacologist. Rounded boxes contain model-generated queries; the colored chips denote the oracle’s feedback (green: Yes, red: No, grey: End).
  • Figure 5: Prompt template for deriving domain-specific keywords that seed the GuessArena card deck.
  • ...and 5 more figures