Deal, or no deal (or who knows)? Forecasting Uncertainty in Conversations using Large Language Models
Anthony Sicilia, Hyunwoo Kim, Khyathi Raghavi Chandu, Malihe Alikhani, Jack Hessel
TL;DR
This work introduces FortUneDial, a framework and benchmark for forecasting uncertainty in conversations, treating dialogue outcomes as probabilistic events and evaluating them with calibrated metrics such as the Brier Score and a new skill score. It compares two forecasting paradigms—Implicit Forecasts, which extract probabilities from token distributions, and Direct Forecasts, which parse probabilities from natural language outputs—and proposes both supervised and reinforcement-learning–driven fine-tuning to calibrate these representations. Through eight difficult negotiation datasets, the authors demonstrate that uncertainty-tuning strategies enable small open-source models to rival or surpass pre-trained models many times larger, especially when incorporating post-hoc inference corrections and priors. The study also analyzes biases, the impact of model and data scale, and the relative benefits of exploration versus exploitation, offering practical guidance for deploying calibrated uncertainty forecasts in socially sensitive interactive settings. The authors release code, models, and data to spur further research in calibrated dialogue forecasting and negotiation analysis.
Abstract
Effective interlocutors account for the uncertain goals, beliefs, and emotions of others. But even the best human conversationalist cannot perfectly anticipate the trajectory of a dialogue. How well can language models represent inherent uncertainty in conversations? We propose FortUne Dial, an expansion of the long-standing "conversation forecasting" task: instead of just accuracy, evaluation is conducted with uncertainty-aware metrics, effectively enabling abstention on individual instances. We study two ways in which language models potentially represent outcome uncertainty (internally, using scores and directly, using tokens) and propose fine-tuning strategies to improve calibration of both representations. Experiments on eight difficult negotiation corpora demonstrate that our proposed fine-tuning strategies (a traditional supervision strategy and an off-policy reinforcement learning strategy) can calibrate smaller open-source models to compete with pre-trained models 10x their size.
