Qworld: Question-Specific Evaluation Criteria for LLMs

Shanghua Gao; Yuchang Su; Pengwei Sui; Curtis Ginder; Marinka Zitnik

Qworld: Question-Specific Evaluation Criteria for LLMs

Shanghua Gao, Yuchang Su, Pengwei Sui, Curtis Ginder, Marinka Zitnik

Abstract

Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question's context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question. On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by human experts. Experts rate Qworld criteria higher in insight and granularity than those produced by prior methods. When applied to 11 frontier LLMs on HealthBench and Humanity's Last Exam, Qworld reveals capability differences in dimensions such as long-term impact, equity, error handling, and interdisciplinary reasoning that coarse rubrics do not distinguish. By formulating criteria generation as structured coverage of question-implied evaluation axes, Qworld enables evaluation that adapts to each question rather than relying on fixed task-level criteria.

Qworld: Question-Specific Evaluation Criteria for LLMs

Abstract

Paper Structure (34 sections, 7 equations, 15 figures, 11 tables, 1 algorithm)

This paper contains 34 sections, 7 equations, 15 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Qworld Approach
Problem Formulation
Recursive Expansion Tree for Criteria Generation
Experiments
Experimental Setup
Evaluating the Quality of Criteria Generated by Qworld
Using Qworld-Generated Criteria to Benchmark LLM Capabilities
Ablation, Robustness, and Scaling of Qworld
Conclusion
Implementation Details
Experimental Setup
Criteria Score Calculation
Retrieval-augmented Qworld Implementation
...and 19 more sections

Figures (15)

Figure 1: Given a question, Qworld generates question-specific evaluation criteria, which can be used by downstream evaluators (e.g., LLM-as-judge) to assess responses across diverse contexts.
Figure 2: Recursive expansion tree for generation of evaluation criteria in Qworld (d) in comparison with (a) chain-of-thought, (b) self-reflection generation, and (c) tree-decomposition generation.
Figure 3: Evaluation using a taxonomy grouped from question-level criteria generated by Qworld on HLE. Qworld provides fine-grained understanding of model response beyond binary accuracy. The taxonomy for HLE focuses reasoning qualities, showing the context-awareness of Qworld method.
Figure 4: Evaluation using a taxonomy grouped from question-level criteria generated by Qworld (b), compared with a human-expert taxonomy (a) on HealthBench.
Figure 5: Human judges rate the value of Qworld unique criteria above 0.90 as the number of criteria increases, indicating that added criteria remain useful rather than redundant.
...and 10 more figures

Qworld: Question-Specific Evaluation Criteria for LLMs

Abstract

Qworld: Question-Specific Evaluation Criteria for LLMs

Authors

Abstract

Table of Contents

Figures (15)