Context-Driven Interactive Query Simulations Based on Generative Large Language Models
Björn Engelmann, Timo Breuer, Jana Isabelle Friese, Philipp Schaer, Norbert Fuhr
TL;DR
This work tackles the gap between traditional Cranfield-style IR evaluation and real user behavior by introducing context-driven, interactive simulations that account for user knowledge state and session dynamics. It integrates two main query-generation approaches—LLM-based prompting and Doc2Query with evolving knowledge states—along with two retrieval paradigms (sparse BM25 and dense MonoT5) to simulate realistic search sessions. The study introduces context-aware evaluation measures (Effort vs. Effect, sDCG, sRBP) and provides a detailed implementation and datasets, demonstrating that context and feedback substantially improve information gain, with probabilistic prompt strategies outperforming rule-based ones. The open experimental setup and multiple analysis perspectives offer a practical path toward higher-fidelity user simulations and more informative IR evaluations. The findings have practical impact for researchers and developers aiming to benchmark IR systems in user-centric contexts and to better understand the trade-offs between retrieval efficiency and user effort.
Abstract
Simulating user interactions enables a more user-oriented evaluation of information retrieval (IR) systems. While user simulations are cost-efficient and reproducible, many approaches often lack fidelity regarding real user behavior. Most notably, current user models neglect the user's context, which is the primary driver of perceived relevance and the interactions with the search results. To this end, this work introduces the simulation of context-driven query reformulations. The proposed query generation methods build upon recent Large Language Model (LLM) approaches and consider the user's context throughout the simulation of a search session. Compared to simple context-free query generation approaches, these methods show better effectiveness and allow the simulation of more efficient IR sessions. Similarly, our evaluations consider more interaction context than current session-based measures and reveal interesting complementary insights in addition to the established evaluation protocols. We conclude with directions for future work and provide an entirely open experimental setup.
