Table of Contents
Fetching ...

Query Disambiguation via Answer-Free Context: Doubling Performance on Humanity's Last Exam

Michael Majurski, Cynthia Matuszek

TL;DR

This work investigates how the quality of background grounding information in a model's context window affects accuracy, and finds that combining well-grounded dynamic context construction (i.e, RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains.

Abstract

How carefully and unambiguously a question is phrased has a profound impact on the quality of the response, for Language Models (LMs) as well as people. While model capabilities continue to advance, the interplay between grounding context and query formulation remains under-explored. This work investigates how the quality of background grounding information in a model's context window affects accuracy. We find that combining well-grounded dynamic context construction (i.e, RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains. Given a user question with associated answer-free grounding context, rewriting the question to reduce ambiguity produces benchmark improvements without changing the answer itself, even compared to prepending that context before the question. Using \texttt{gpt-oss-20b} to rewrite a subset of Humanity's Last Exam using answer-free grounding context improves \texttt{gpt-5-mini} accuracy from 0.14 to 0.37. We demonstrate that this accuracy improvement cannot be fully recovered just through prompting at inference time; rather, distinct rewriting and answering phases are required. Code and data are available at https://github.com/mmajurski/lm-rewrite-uplift

Query Disambiguation via Answer-Free Context: Doubling Performance on Humanity's Last Exam

TL;DR

This work investigates how the quality of background grounding information in a model's context window affects accuracy, and finds that combining well-grounded dynamic context construction (i.e, RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains.

Abstract

How carefully and unambiguously a question is phrased has a profound impact on the quality of the response, for Language Models (LMs) as well as people. While model capabilities continue to advance, the interplay between grounding context and query formulation remains under-explored. This work investigates how the quality of background grounding information in a model's context window affects accuracy. We find that combining well-grounded dynamic context construction (i.e, RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains. Given a user question with associated answer-free grounding context, rewriting the question to reduce ambiguity produces benchmark improvements without changing the answer itself, even compared to prepending that context before the question. Using \texttt{gpt-oss-20b} to rewrite a subset of Humanity's Last Exam using answer-free grounding context improves \texttt{gpt-5-mini} accuracy from 0.14 to 0.37. We demonstrate that this accuracy improvement cannot be fully recovered just through prompting at inference time; rather, distinct rewriting and answering phases are required. Code and data are available at https://github.com/mmajurski/lm-rewrite-uplift
Paper Structure (21 sections, 14 figures, 2 tables)

This paper contains 21 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: When RAG systems surface relevant information, LM performance can be enhanced by rewriting the initial query using context---added information that, without providing the answer, gives relevant background knowledge and direction.
  • Figure 2: Rewriting questions using answer-free grounding context yields significant accuracy improvement (distance above line) over the original questions, as evaluated on a subset of Humanity's Last Exam (HLE). See \ref{['fig:all-scatterplots']} for a complete legend.
  • Figure 3: The quality of context presented to an LM has an impact on question-answering performance. Accuracy is highest when RAG systems surface context containing the answer (cyan), but when the question is presented without context (orange) or the surfaced information does not contain the answer (green), performance (without query rewriting) suffers. The act of interpretation during question rewriting (pink) produces an accuracy improvement beyond just prepending the Question + AnswerFreeContext information used to rewrite before the question; despite not including the AFC when asking an LM the rewritten question.
  • Figure 4: Performance of various LMs with and without answer-containing context and answer-free context. The x-axis shows the original question benchmark accuracy while the y-axis shows the rewritten question benchmark accuracy; distance above the line conveys improvement. Colors represent the different models tested (\ref{['tab:model_cutoff']}) and shapes represent the dataset (\ref{['tab:dataset_publication_cutoff']}). (top left) Answers are present in the context (baseline): Benchmark performance significantly improves with the addition of context that contains the answer. (top right) Answers not present in the context: Simply providing relevant answer free context without question rewriting does not improve performance. (bottom left) Questions rewritten using answer-free context (our approach): Improvement in accuracy caused by rewriting the question only using AFC where during benchmarking only the rewritten question is presented to the LM (AFC is withheld).
  • Figure 5: Per-dataset per-model difference in benchmark accuracy between the rewritten questions and the original questions with associated answer-free context. The violin plot distribution highlights the range of accuracy deltas over all datasets for each model evaluated. Benchmark accuracy improved by an average of 0.1346.
  • ...and 9 more figures