On Overcoming Miscalibrated Conversational Priors in LLM-based Chatbots

Christine Herlihy; Jennifer Neville; Tobias Schnabel; Adith Swaminathan

On Overcoming Miscalibrated Conversational Priors in LLM-based Chatbots

Christine Herlihy, Jennifer Neville, Tobias Schnabel, Adith Swaminathan

TL;DR

The paper addresses miscalibrated conversational priors in LLM-based chatbots facing under-specified user queries by modeling interactions as a $PODP$ and showing how current RLHF-tuned policies fail to trade off uncertainty reduction with utility. It introduces two prompting-based interventions—data-agnostic prompts (CoT and Clarify-Flex) and a data-based meta-policy learned from logged conversations—to recalibrate responses without fine-tuning. Across synthetic recommendation tasks, these approaches improve expected utility for under-specified queries, with Clarify-Flex and the learned meta-policy providing notable gains and converging to baseline as queries become well-specified. The results demonstrate a practical, data-efficient path to re-align LLM chatbots with user goals in multi-turn interactions, leveraging only black-box LLM access and historical logs.

Abstract

We explore the use of Large Language Model (LLM-based) chatbots to power recommender systems. We observe that the chatbots respond poorly when they encounter under-specified requests (e.g., they make incorrect assumptions, hedge with a long response, or refuse to answer). We conjecture that such miscalibrated response tendencies (i.e., conversational priors) can be attributed to LLM fine-tuning using annotators -- single-turn annotations may not capture multi-turn conversation utility, and the annotators' preferences may not even be representative of users interacting with a recommender system. We first analyze public LLM chat logs to conclude that query under-specification is common. Next, we study synthetic recommendation problems with configurable latent item utilities and frame them as Partially Observed Decision Processes (PODP). We find that pre-trained LLMs can be sub-optimal for PODPs and derive better policies that clarify under-specified queries when appropriate. Then, we re-calibrate LLMs by prompting them with learned control messages to approximate the improved policy. Finally, we show empirically that our lightweight learning approach effectively uses logged conversation data to re-calibrate the response strategies of LLM-based chatbots for recommendation tasks.

On Overcoming Miscalibrated Conversational Priors in LLM-based Chatbots

TL;DR

The paper addresses miscalibrated conversational priors in LLM-based chatbots facing under-specified user queries by modeling interactions as a

and showing how current RLHF-tuned policies fail to trade off uncertainty reduction with utility. It introduces two prompting-based interventions—data-agnostic prompts (CoT and Clarify-Flex) and a data-based meta-policy learned from logged conversations—to recalibrate responses without fine-tuning. Across synthetic recommendation tasks, these approaches improve expected utility for under-specified queries, with Clarify-Flex and the learned meta-policy providing notable gains and converging to baseline as queries become well-specified. The results demonstrate a practical, data-efficient path to re-align LLM chatbots with user goals in multi-turn interactions, leveraging only black-box LLM access and historical logs.

Abstract

Paper Structure (39 sections, 8 equations, 16 figures, 9 tables, 1 algorithm)

This paper contains 39 sections, 8 equations, 16 figures, 9 tables, 1 algorithm.

Introduction
Problem Formulation
Policies Induced By Prompting LLMs
Query Under-specification Causes Sub-optimal Interactions
Meta-Policies
Characterizing and Inducing Chatbot Response Behaviors
Motivating Experiments
Query Underspecification is Common
LLM Policies Can Be Sub-optimal When Queries are Under-specified
Synthetic query corpus
Sub-optimality of LLM in single-step interaction
Sub-optimality of LLM in multi-step interactions
Algorithmic Approach
Data-Agnostic Interventions
Approach 1: Chain of Thought (CoT).
...and 24 more sections

Figures (16)

Figure 1: An example failure where a user's query is under-specified (blue text). Current LLM-based chatbots produce long responses in order to hedge against uncertainty (purple text). Clarifying the user's context can avert this failure.
Figure 2: Real-world users asked severely under-specified queries more than $23\%$ of the time in the OpenAssistant dataset ($n = 600$).
Figure 3: For a user query such as $q$: "Is my outfit appropriate for the event I'm attending tonight?" an LLM-based chatbot can choose different response strategies. These strategies produce responses that differ in their cognitive costs (x-axis) while providing final answers with different, user-specific levels of usefulness (y-axis). A good chatbot should respond so as to maximize overall utility---i.e., by providing useful and low-cost answers for the user.
Figure 4: PODP plate diagram illustrating user-chatbot interactions, prompt-induced policies ($\pi^p$), and the meta-policy mapping from conversations to prompts ($\beta$).
Figure 5: Spectrum characterizing the response strategies that a LLM-based chatbot can take. RLHF fine-tuning encourages Respond and Hedge, whereas Clarify may be more appropriate when user queries are under-specified.
...and 11 more figures

Theorems & Definitions (1)

Definition 2.1: Under-specification

On Overcoming Miscalibrated Conversational Priors in LLM-based Chatbots

TL;DR

Abstract

On Overcoming Miscalibrated Conversational Priors in LLM-based Chatbots

Authors

TL;DR

Abstract

Table of Contents

Figures (16)

Theorems & Definitions (1)