Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing

William Watson; Nicole Cho; Nishan Srishankar

Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing

William Watson, Nicole Cho, Nishan Srishankar

TL;DR

HalluciBot addresses LLM hallucination by shifting focus to pre-generation query quality, estimating the probability that a given query will induce hallucinations without invoking a generator during inference. It achieves this through a Multi-Agent Monte Carlo pipeline that perturbs each query ($n=5$ perturbations, yielding $n+1=6$ variants) and aggregates over $2{,}219{,}022$ outputs to obtain $p_h(q_0)$, which is then learned by a RoBERTa-based encoder with dual heads for hallucination probability and consensus. The authors demonstrate that HalluciBot enables four downstream modes—Ratiocinate, Rewrite, Rank, and Route (H4R)—to improve query quality, reduce hallucinations, and smartly route queries across pipelines, with reported gains such as a $76.0\%$ test F1 for binary hallucination detection and notable improvements in rewrite and ranking performance, as well as substantial computation savings. The work lays a foundation for query-centric mitigation strategies that can generalize to RAG-enabled, few-shot, or API-restricted systems, albeit with considerations around the computational cost and potential label noise from Monte Carlo sampling.

Abstract

Hallucination continues to be one of the most critical challenges in the institutional adoption journey of Large Language Models (LLMs). While prior studies have primarily focused on the post-generation analysis and refinement of outputs, this paper centers on the effectiveness of queries in eliciting accurate responses from LLMs. We present HalluciBot, a model that estimates the query's propensity to hallucinate before generation, without invoking any LLMs during inference. HalluciBot can serve as a proxy reward model for query rewriting, offering a general framework to estimate query quality based on accuracy and consensus. In essence, HalluciBot investigates how poorly constructed queries can lead to erroneous outputs - moreover, by employing query rewriting guided by HalluciBot's empirical estimates, we demonstrate that 95.7% output accuracy can be achieved for Multiple Choice questions. The training procedure for HalluciBot consists of perturbing 369,837 queries n times, employing n+1 independent LLM agents, sampling an output from each query, conducting a Multi-Agent Monte Carlo simulation on the sampled outputs, and training an encoder classifier. The idea of perturbation is the outcome of our ablation studies that measures the increase in output diversity (+12.5 agreement spread) by perturbing a query in lexically different but semantically similar ways. Therefore, HalluciBot paves the way to ratiocinate (76.0% test F1 score, 46.6% in saved computation on hallucinatory queries), rewrite (+30.2% positive class transition from hallucinatory to non-hallucinatory), rank (+50.6% positive class transition from hallucinatory to non-hallucinatory), and route queries to effective pipelines.

Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing

TL;DR

perturbations, yielding

variants) and aggregates over

outputs to obtain

, which is then learned by a RoBERTa-based encoder with dual heads for hallucination probability and consensus. The authors demonstrate that HalluciBot enables four downstream modes—Ratiocinate, Rewrite, Rank, and Route (H4R)—to improve query quality, reduce hallucinations, and smartly route queries across pipelines, with reported gains such as a

test F1 for binary hallucination detection and notable improvements in rewrite and ranking performance, as well as substantial computation savings. The work lays a foundation for query-centric mitigation strategies that can generalize to RAG-enabled, few-shot, or API-restricted systems, albeit with considerations around the computational cost and potential label noise from Monte Carlo sampling.

Abstract

Paper Structure (56 sections, 23 equations, 8 figures, 19 tables)

This paper contains 56 sections, 23 equations, 8 figures, 19 tables.

Introduction
Related Work
Methodology Overview
Multi-Agent Monte Carlo Simulation
Converting Monte Carlo Estimates To Labels
How To Train a Classifier?
Experimental Setup
Analysis & Discussion
Conclusion
Definitions
What is Extractive QA?
What is Multiple Choice QA?
What is Abstractive QA?
What is Temperature and Nucleus Sampling?
What is a Multi-Agent Simulation?
...and 41 more sections

Figures (8)

Figure 1: Comparison of traditional inference methods and HalluciBot's use-cases. In the former, the user inputs a query either through a direct inference or a retrieval-augmented generation (RAG) pipeline. If the output is hallucinatory, the user must decide whether to end the session or revise the query for successive generation rounds. In contrast, HalluciBot can be used to assess the query's quality before generation. Therefore, users can gain insight into the hallucination risk ("Ratiocinate"), automate the query rewriting stage through informed feedback ("Rewrite") or Best-of-N sampling across multiple candidates ("Rank"), and route the query across different operating modes ("Route"), since HalluciBot is scenario-aware (Extractive / Abstractive), potentially bypassing computationally expensive stages, such as RAG or Rewrite.
Figure 2: Training Overview. A single query, $q_0$, is perturbed in $n$ different ways. Next, The original and perturbed queries $q_i$ are independently answered by the Generator agents. This Multi-Agent Monte Carlo simulation provides an estimate into the rate of hallucination$p_h(q_0)$ for the original query $q_0$. Via these simulated results, HalluciBot is trained to predict the probability that any $q_0$could hallucinate, and predict the expected consensus of sampled outputs before generation.
Figure 3: Distribution of the observed number of hallucinations per scenario. For Extractive, additional context mitigates the rate of hallucination. For Multiple Choice, distractors can cause confusion amongst agents uniformly. However, for Abstractive, no additional information can cause massive disparities in correctness - most of our simulations resulted in no or all hallucinations.
Figure 4: Binary distribution of labels, where at least one hallucination occurred during our simulation.
Figure 5: Top: HalluciBot calibration curves with Brier Scores (BS), alongside the histogram of predicted probabilities. Bottom: Predicted hallucination labels juxtaposed against observed hallucination rates during our Monte Carlo simulation, with calibrated matrix below. We highlight $1$-$6$ as corresponding to the binary label "Yes - Hallucinatory" ($y=1$) during training. Notably, there is significant confusion in queries that are borderline ($1$, $2$) rather than majority hallucinatory prone ($3$-$6$).
...and 3 more figures

Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing

TL;DR

Abstract

Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing

Authors

TL;DR

Abstract

Table of Contents

Figures (8)