What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance

William Watson; Nicole Cho; Sumitra Ganesh; Manuela Veloso

What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance

William Watson, Nicole Cho, Sumitra Ganesh, Manuela Veloso

TL;DR

It is argued that a query's form can also shape a listener's (and model's) response, and an empirically observable query-feature representation correlated with hallucination risk is established, paving the way for guided query rewriting and future intervention studies.

Abstract

Large Language Model (LLM) hallucinations are usually treated as defects of the model or its decoding strategy. Drawing on classical linguistics, we argue that a query's form can also shape a listener's (and model's) response. We operationalize this insight by constructing a 22-dimension query feature vector covering clause complexity, lexical rarity, and anaphora, negation, answerability, and intention grounding, all known to affect human comprehension. Using 369,837 real-world queries, we ask: Are there certain types of queries that make hallucination more likely? A large-scale analysis reveals a consistent "risk landscape": certain features such as deep clause nesting and underspecification align with higher hallucination propensity. In contrast, clear intention grounding and answerability align with lower hallucination rates. Others, including domain specificity, show mixed, dataset- and model-dependent effects. Thus, these findings establish an empirically observable query-feature representation correlated with hallucination risk, paving the way for guided query rewriting and future intervention studies.

What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance

TL;DR

Abstract

Paper Structure (36 sections, 8 equations, 14 figures, 9 tables)

This paper contains 36 sections, 8 equations, 14 figures, 9 tables.

Introduction
Related Work
Methodology
Linguistic features
Observed risk via semantics-preserving perturbations
Ordinal risk model
Metrics and diagnostics
Robustness
Experimental Setup
Results: A Query-Feature Risk Landscape for Hallucination
Feature and Dataset Effects
Distributional Effects
Task Format Moderates Absolute Risk But Not Direction
Propensity overlap & uplifts
Robustness Across Datasets
...and 21 more sections

Figures (14)

Figure 1: Prevalence of binary linguistic features across hallucination risk categories (Safe, Borderline, Risky). Warmer colors indicate higher frequency. Lack of specificity, clause complexity, and polysemous words show a pronounced rise from Safe to Risky.
Figure 2: ECDFs of predicted $P(\text{Risky})$ for Present vs. Absent (top six by KS). Shaded regions indicate dominance; inset shows KS and $\Delta$median. Lack of Specificity, Excessive Details, Clause Complexity, and Query–Scenario Mismatch shift mass toward higher risk; Answerability and Intention Grounding shift mass lower.
Figure 3: Risk vs. query length by scenario. Each curve shows the empirical probability of a risky output (fraction of "Risky" labels) after quantile-binning query length within a scenario ($\geq50$ examples per bin). Risk rises with length for Abstractive, remains low/flat for Extractive, and is intermediate for Multiple-Choice. Takeaway: longer, open-ended queries are more hallucination-prone, while extractive settings remain robust across lengths.
Figure 4: Feature coefficients $\beta$ (left) and dataset/scenario fixed effects $\alpha,\gamma$ (right) from the ordinal logit model. Positive values increase log-odds of Risky.Answerability is strongly protective; Lack of Specificity, Negation, and Anaphora increase risk.
Figure 5: LODO coefficient stability (ordinal logit). Each point is a feature coefficient estimated when one dataset is held out (color = held-out dataset’s scenario), with short horizontal bars showing the mean and $\pm$1 s.d. across LODO runs. The blue diamond is the pooled (full-fit) coefficient. Signs and magnitudes are stable: Lack of Specificity, Clause Complexity, Query–Scenario Mismatch remain risk-increasing, while Answerability remains strongly protective.
...and 9 more figures

What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance

TL;DR

Abstract

What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance

Authors

TL;DR

Abstract

Table of Contents

Figures (14)