Table of Contents
Fetching ...

No One Size Fits All: QueryBandits for Hallucination Mitigation

Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso

TL;DR

QueryBandits is introduced, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function, substantiating the finding that there is no single rewrite policy optimal for all queries.

Abstract

Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates our finding that there is no single rewrite policy optimal for all queries. We also discover that certain static policies incur higher cumulative regret than No-Rewrite, indicating that an inflexible query-rewriting policy can worsen hallucinations. Thus, learning an online policy over semantic features with QueryBandits can shift model behavior purely through forward-pass mechanisms, enabling its use with closed-source models and bypassing the need for retraining or gradient-based adaptation.

No One Size Fits All: QueryBandits for Hallucination Mitigation

TL;DR

QueryBandits is introduced, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function, substantiating the finding that there is no single rewrite policy optimal for all queries.

Abstract

Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates our finding that there is no single rewrite policy optimal for all queries. We also discover that certain static policies incur higher cumulative regret than No-Rewrite, indicating that an inflexible query-rewriting policy can worsen hallucinations. Thus, learning an online policy over semantic features with QueryBandits can shift model behavior purely through forward-pass mechanisms, enabling its use with closed-source models and bypassing the need for retraining or gradient-based adaptation.
Paper Structure (20 sections, 27 equations, 26 figures, 11 tables, 1 algorithm)

This paper contains 20 sections, 27 equations, 26 figures, 11 tables, 1 algorithm.

Figures (26)

  • Figure 1: QueryBandits selects a rewrite that fixes a counting error. The original query $x_t$ elicits a hallucinatory count ($8$ integers) due to an ambiguous lower bound ($6$). Conditioned on the query's 17-dimensional feature vector, QueryBandits selects Expand and rewrites the query to $x'_t$ with explicit bounds; the LLM then returns the correct cardinality ($9$). Noticeably, the feature vector also shifts: subordination (more complex clauses) appears while specialization (domain-specific knowledge required) disappears-illustrating how rewriting alters the salient semantics of $x_t$.
  • Figure 2: (\ref{['fig:simplex']}) Our chosen $(\alpha,\beta,\gamma)$ lies deep in the 1% optimal frontier. (\ref{['fig:rank']}) Breakdown of per‐dataset arm performance: different datasets consistently favor different rewrite strategies
  • Figure 3: Cumulative Reward (averaged across all runs). Sorted by final performance, highlighting gains achieved by contextual bandits over non‐contextual learners and static rewrites.
  • Figure 4: Contextual Per‐Feature Variance by Arm. For each arm, we compute the variance of each binary linguistic feature over all queries on which that arm was chosen. High variance means the bandit frequently switches the arm on that feature’s presence.
  • Figure 5: Contextual Feature Contribution Strength. These are the averaged $\theta$ weights (direct contributions) per feature to the expected reward under each arm. Positive weights indicate features that boost that arm’s reward; negative weights indicate features that penalize it.
  • ...and 21 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2