Table of Contents
Fetching ...

Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLMs

Shuhang Xu, Weijian Deng, Yixuan Zhou, Fangwei Zhong

TL;DR

CK-Arena introduces a game-based benchmark to assess conceptual knowledge boundaries in LLMs by embedding them in an interactive Undercover-style multi-agent setting. The framework includes two modes, a civilian/undercover dynamic and an Undercover-Audience variant, with judges scoring statements on novelty, relevance, and reasonableness, and a robust data collection pipeline. Experimental results across six models reveal that conceptual understanding varies by category and is not strictly aligned with model size, highlighting the need for concept-focused evaluation beyond raw scale. The work provides a 529-pair concept dataset, formal metrics, and an automated evaluation process, offering a scalable path to study and improve concept-aware reasoning in LLMs with potential for future multilingual expansion and broader concept coverage.

Abstract

Concepts represent generalized abstractions that enable humans to categorize and reason efficiently, yet it is unclear to what extent Large Language Models (LLMs) comprehend these semantic relationships. Existing benchmarks typically focus on factual recall and isolated tasks, failing to evaluate the ability of LLMs to understand conceptual boundaries. To address this gap, we introduce CK-Arena, a multi-agent interaction game built upon the Undercover game, designed to evaluate the capacity of LLMs to reason with concepts in interactive settings. CK-Arena challenges models to describe, differentiate, and infer conceptual boundaries based on partial information, encouraging models to explore commonalities and distinctions between closely related concepts. By simulating real-world interaction, CK-Arena provides a scalable and realistic benchmark for assessing conceptual reasoning in dynamic environments. Experimental results show that LLMs' understanding of conceptual knowledge varies significantly across different categories and is not strictly aligned with parameter size or general model capabilities. The data and code are available at the project homepage: https://ck-arena.site.

Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLMs

TL;DR

CK-Arena introduces a game-based benchmark to assess conceptual knowledge boundaries in LLMs by embedding them in an interactive Undercover-style multi-agent setting. The framework includes two modes, a civilian/undercover dynamic and an Undercover-Audience variant, with judges scoring statements on novelty, relevance, and reasonableness, and a robust data collection pipeline. Experimental results across six models reveal that conceptual understanding varies by category and is not strictly aligned with model size, highlighting the need for concept-focused evaluation beyond raw scale. The work provides a 529-pair concept dataset, formal metrics, and an automated evaluation process, offering a scalable path to study and improve concept-aware reasoning in LLMs with potential for future multilingual expansion and broader concept coverage.

Abstract

Concepts represent generalized abstractions that enable humans to categorize and reason efficiently, yet it is unclear to what extent Large Language Models (LLMs) comprehend these semantic relationships. Existing benchmarks typically focus on factual recall and isolated tasks, failing to evaluate the ability of LLMs to understand conceptual boundaries. To address this gap, we introduce CK-Arena, a multi-agent interaction game built upon the Undercover game, designed to evaluate the capacity of LLMs to reason with concepts in interactive settings. CK-Arena challenges models to describe, differentiate, and infer conceptual boundaries based on partial information, encouraging models to explore commonalities and distinctions between closely related concepts. By simulating real-world interaction, CK-Arena provides a scalable and realistic benchmark for assessing conceptual reasoning in dynamic environments. Experimental results show that LLMs' understanding of conceptual knowledge varies significantly across different categories and is not strictly aligned with parameter size or general model capabilities. The data and code are available at the project homepage: https://ck-arena.site.

Paper Structure

This paper contains 33 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Conceptual knowledge arena (CK-Arena). A benchmark designed to evaluate the ability of Large Language Models (LLMs) to understand and reason with conceptual knowledge boundaries. Built upon the interactive game Undercover, CK-Arena challenges LLMs to take on roles as players and judges, navigating concept pairs that share both commonalities and unique distinctions. Through multi-agent interaction, LLMs generate descriptive statements, reason about semantic similarities and differences, and make strategic decisions based on partial information. Judges evaluate these interactions based on metrics such as novelty, relevance, and reasonableness, providing insights into the LLMs’ conceptual reasoning capabilities in realistic, dynamic environments.
  • Figure 2: The t-SNE visualization of all embedded statements in the Tools category for GPT-4o and Gemini-2.0-pro-exp. It shows that the distribution of Gemini-2.0-pro-exp's statements is more widespread, while GPT-4o's distribution is more concentrated. This indicates that Gemini-2.0-pro-exp captures a broader range of conceptual knowledge, which indirectly reflects a deeper understanding of concepts.
  • Figure 3: The win rate performance of six LLMs across $11$ categories. A comparative analysis reveals that each model exhibits distinct strengths and weaknesses across different concept categories. These variations are likely influenced by differences in training data, architectural design, and optimization strategies specific to each model. The analysis reveals models’ focus areas, knowledge gaps, and insights for improving conceptual reasoning.
  • Figure 4: Relevance scores of different LLMs across various categories. In this heatmap, the darker the color, the higher the score, intuitively reflecting the association between the descriptions and concepts of each LLM in different categories.
  • Figure 5: The t-SNE visualization of all embedded statements in the Animals category for Gemini-2.0-Pro-Exp and Claude-3-5-Haiku-20241022.
  • ...and 9 more figures