Table of Contents
Fetching ...

Deal, or no deal (or who knows)? Forecasting Uncertainty in Conversations using Large Language Models

Anthony Sicilia, Hyunwoo Kim, Khyathi Raghavi Chandu, Malihe Alikhani, Jack Hessel

TL;DR

This work introduces FortUneDial, a framework and benchmark for forecasting uncertainty in conversations, treating dialogue outcomes as probabilistic events and evaluating them with calibrated metrics such as the Brier Score and a new skill score. It compares two forecasting paradigms—Implicit Forecasts, which extract probabilities from token distributions, and Direct Forecasts, which parse probabilities from natural language outputs—and proposes both supervised and reinforcement-learning–driven fine-tuning to calibrate these representations. Through eight difficult negotiation datasets, the authors demonstrate that uncertainty-tuning strategies enable small open-source models to rival or surpass pre-trained models many times larger, especially when incorporating post-hoc inference corrections and priors. The study also analyzes biases, the impact of model and data scale, and the relative benefits of exploration versus exploitation, offering practical guidance for deploying calibrated uncertainty forecasts in socially sensitive interactive settings. The authors release code, models, and data to spur further research in calibrated dialogue forecasting and negotiation analysis.

Abstract

Effective interlocutors account for the uncertain goals, beliefs, and emotions of others. But even the best human conversationalist cannot perfectly anticipate the trajectory of a dialogue. How well can language models represent inherent uncertainty in conversations? We propose FortUne Dial, an expansion of the long-standing "conversation forecasting" task: instead of just accuracy, evaluation is conducted with uncertainty-aware metrics, effectively enabling abstention on individual instances. We study two ways in which language models potentially represent outcome uncertainty (internally, using scores and directly, using tokens) and propose fine-tuning strategies to improve calibration of both representations. Experiments on eight difficult negotiation corpora demonstrate that our proposed fine-tuning strategies (a traditional supervision strategy and an off-policy reinforcement learning strategy) can calibrate smaller open-source models to compete with pre-trained models 10x their size.

Deal, or no deal (or who knows)? Forecasting Uncertainty in Conversations using Large Language Models

TL;DR

This work introduces FortUneDial, a framework and benchmark for forecasting uncertainty in conversations, treating dialogue outcomes as probabilistic events and evaluating them with calibrated metrics such as the Brier Score and a new skill score. It compares two forecasting paradigms—Implicit Forecasts, which extract probabilities from token distributions, and Direct Forecasts, which parse probabilities from natural language outputs—and proposes both supervised and reinforcement-learning–driven fine-tuning to calibrate these representations. Through eight difficult negotiation datasets, the authors demonstrate that uncertainty-tuning strategies enable small open-source models to rival or surpass pre-trained models many times larger, especially when incorporating post-hoc inference corrections and priors. The study also analyzes biases, the impact of model and data scale, and the relative benefits of exploration versus exploitation, offering practical guidance for deploying calibrated uncertainty forecasts in socially sensitive interactive settings. The authors release code, models, and data to spur further research in calibrated dialogue forecasting and negotiation analysis.

Abstract

Effective interlocutors account for the uncertain goals, beliefs, and emotions of others. But even the best human conversationalist cannot perfectly anticipate the trajectory of a dialogue. How well can language models represent inherent uncertainty in conversations? We propose FortUne Dial, an expansion of the long-standing "conversation forecasting" task: instead of just accuracy, evaluation is conducted with uncertainty-aware metrics, effectively enabling abstention on individual instances. We study two ways in which language models potentially represent outcome uncertainty (internally, using scores and directly, using tokens) and propose fine-tuning strategies to improve calibration of both representations. Experiments on eight difficult negotiation corpora demonstrate that our proposed fine-tuning strategies (a traditional supervision strategy and an off-policy reinforcement learning strategy) can calibrate smaller open-source models to compete with pre-trained models 10x their size.
Paper Structure (58 sections, 21 equations, 2 figures, 7 tables)

This paper contains 58 sections, 21 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: FortUneDial tests the ability of language models to represent uncertainty about future conversation outcomes. To meet this task, we tune models to express uncertainty directly in their output tokens or implicitly in their score distributions. We also provide additional strategies to correct uncertainty at inference-time. We propose tasks across 8 datasets, experimenting with GPT-4, Llama-2, and Zephyr-style models to release our best performing models publicly.
  • Figure 2: Examples of model forecasts for the eventual occurrence of a personal attack. Models receive priors from data (§ \ref{['sec:post-process']}) without any forecast scaling. Tuning (§ \ref{['sec:sft']}, § \ref{['sec:utune_df']}) improves 7B parameter models and GPT-4 shows bias against conflict, compared to other models (§ \ref{['sec:results']}). The nuances that lead to conflicts are not necessarily obvious.