Table of Contents
Fetching ...

Interactive Dialogue Agents via Reinforcement Learning on Hindsight Regenerations

Joey Hong, Jessica Lin, Anca Dragan, Sergey Levine

TL;DR

The key insight is that while LLMs may not be adept at identifying effective strategies for steering conversations a priori, or in the middle of an ongoing conversation, they can do so post-hoc, or in hindsight, after seeing how their conversational partner responds.

Abstract

Recent progress on large language models (LLMs) has enabled dialogue agents to generate highly naturalistic and plausible text. However, current LLM language generation focuses on responding accurately to questions and requests with a single effective response. In reality, many real dialogues are interactive, meaning an agent's utterances will influence their conversational partner, elicit information, or change their opinion. Accounting for how an agent can effectively steer a conversation is a crucial ability in many dialogue tasks, from healthcare to preference elicitation. Existing methods for fine-tuning dialogue agents to accomplish such tasks would rely on curating some amount of expert data. However, doing so often requires understanding the underlying cognitive processes of the conversational partner, which is a skill neither humans nor LLMs trained on human data can reliably do. Our key insight is that while LLMs may not be adept at identifying effective strategies for steering conversations a priori, or in the middle of an ongoing conversation, they can do so post-hoc, or in hindsight, after seeing how their conversational partner responds. We use this fact to rewrite and augment existing suboptimal data, and train via offline reinforcement learning (RL) an agent that outperforms both prompting and learning from unaltered human demonstrations. We apply our approach to two domains that require understanding human mental state, intelligent interaction, and persuasion: mental health support, and soliciting charitable donations. Our results in a user study with real humans show that our approach greatly outperforms existing state-of-the-art dialogue agents.

Interactive Dialogue Agents via Reinforcement Learning on Hindsight Regenerations

TL;DR

The key insight is that while LLMs may not be adept at identifying effective strategies for steering conversations a priori, or in the middle of an ongoing conversation, they can do so post-hoc, or in hindsight, after seeing how their conversational partner responds.

Abstract

Recent progress on large language models (LLMs) has enabled dialogue agents to generate highly naturalistic and plausible text. However, current LLM language generation focuses on responding accurately to questions and requests with a single effective response. In reality, many real dialogues are interactive, meaning an agent's utterances will influence their conversational partner, elicit information, or change their opinion. Accounting for how an agent can effectively steer a conversation is a crucial ability in many dialogue tasks, from healthcare to preference elicitation. Existing methods for fine-tuning dialogue agents to accomplish such tasks would rely on curating some amount of expert data. However, doing so often requires understanding the underlying cognitive processes of the conversational partner, which is a skill neither humans nor LLMs trained on human data can reliably do. Our key insight is that while LLMs may not be adept at identifying effective strategies for steering conversations a priori, or in the middle of an ongoing conversation, they can do so post-hoc, or in hindsight, after seeing how their conversational partner responds. We use this fact to rewrite and augment existing suboptimal data, and train via offline reinforcement learning (RL) an agent that outperforms both prompting and learning from unaltered human demonstrations. We apply our approach to two domains that require understanding human mental state, intelligent interaction, and persuasion: mental health support, and soliciting charitable donations. Our results in a user study with real humans show that our approach greatly outperforms existing state-of-the-art dialogue agents.

Paper Structure

This paper contains 17 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overall scheme for hindsight regenerations, our proposed approach for augmenting data to train language agents via RL.
  • Figure 2: Overview of our approach. We relabel suboptimal actions in the original dataset, then generate plausible completions of the dialogue after relabeling to obtain hindsight regenerations. Then, these regenerations are aggregated with the original data to be used for downstream offline RL.
  • Figure 3: Example dialogues by each agent trying to persuade a skeptical user to donate. Prompted agents often concede too easily. On the other hand, ablations such as Zero-shot RL and RFT are unable to adapt persuasive strategies when emotional appeals are shown to be ineffective. Only our proposed Hindsight RL agent tries to address the user's skepticism in a polite and intelligent manner.
  • Figure 4: Example dialogues by our Hindsight RL agent showing it can adapt its strategy (between emotional and logical appeal) based on user's perceived cognitive state.
  • Figure 5: Example dialogues by each agent counseling a user stressed about work. Note that only our proposed Hindsight RL agent makes helpful suggestions for actions the user can take.