Table of Contents
Fetching ...

Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

Fei Wei, Daoyuan Chen, Ce Wang, Yilun Huang, Yushuo Chen, Xuchen Pan, Yaliang Li, Bolin Ding

TL;DR

This work tackles the challenge of making LLMs proactive in high-stakes domains by bridging the reality gap without user simulators. It introduces Learn-to-Ask, a simulator-free offline RL framework that learns long-horizon dialogue policies directly from real expert logs by grounding per-turn rewards in the observed future via ground-truth targets $I^*_t$ and $s^*_t$. Rewards are decomposed into micro (question utility) and macro (assessment stopping) components and integrated through a hierarchical fusion, with automated prompt calibration (Auto-Prompt) ensuring fidelity of extraction and grading. The approach is validated offline on RealMedConv and deployed in a live production medical AI service, where the learned policies achieved near-superhuman task success and meaningful business impact, including large gains in information gathering and dialog-to-purchase conversions. The work provides a practical blueprint for transforming passive LLMs into goal-directed agents, while outlining theoretical connections to offline RL, causal inference, and information- gain frameworks, and pointing to future directions for more autonomous, safe, and scalable proactive AI systems.

Abstract

Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.

Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

TL;DR

This work tackles the challenge of making LLMs proactive in high-stakes domains by bridging the reality gap without user simulators. It introduces Learn-to-Ask, a simulator-free offline RL framework that learns long-horizon dialogue policies directly from real expert logs by grounding per-turn rewards in the observed future via ground-truth targets and . Rewards are decomposed into micro (question utility) and macro (assessment stopping) components and integrated through a hierarchical fusion, with automated prompt calibration (Auto-Prompt) ensuring fidelity of extraction and grading. The approach is validated offline on RealMedConv and deployed in a live production medical AI service, where the learned policies achieved near-superhuman task success and meaningful business impact, including large gains in information gathering and dialog-to-purchase conversions. The work provides a practical blueprint for transforming passive LLMs into goal-directed agents, while outlining theoretical connections to offline RL, causal inference, and information- gain frameworks, and pointing to future directions for more autonomous, safe, and scalable proactive AI systems.

Abstract

Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.

Paper Structure

This paper contains 63 sections, 7 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: The overview of the proposed Learn-to-Ask framework. which transforms the intractable offline RL problem into a sequence of tractable supervised learning tasks.
  • Figure 2: A case study comparing dialogues generated by SFT and Learn-to-Ask models.
  • Figure 3: An illustrative example of the conceptual graph model.
  • Figure 4: The reward growing curves of RL algorithms in training 7B (left) and 32B (right) models.
  • Figure 5: The evaluation results on general capabilities benchmarks on our models with 7B and 32B parameters.