Table of Contents
Fetching ...

Causal Discovery and Counterfactual Reasoning to Optimize Persuasive Dialogue Policies

Donghuo Zeng, Roberto Legaspi, Yuewen Sun, Xinshuai Dong, Kazushi Ikeda, Peter Spirtes, Kun Zhang

TL;DR

This work addresses adapting persuasive dialogue systems to dynamic user states by integrating causal discovery and counterfactual reasoning into reinforcement learning. The pipeline identifies causal relations between persuadee and persuader strategies using GRaSP, generates plausible counterfactual utterances with BiCoGAN guided by these relations, and learns optimal dialogue policies via Dueling Double Deep Q-Networks trained on counterfactual data. Empirical results on the PersuasionForGood dataset show that the causality-guided counterfactual approach yields higher Q-values and greater predicted donations than baselines, demonstrating the value of causal structure in generating effective counterfactuals and informing policy. The study highlights the potential of combining causal inference with counterfactual data to dynamically tailor persuasive interactions, with implications for marketing, health communication, and social-good initiatives, while noting the need for real-user studies and broader personalization in future work.

Abstract

Tailoring persuasive conversations to users leads to more effective persuasion. However, existing dialogue systems often struggle to adapt to dynamically evolving user states. This paper presents a novel method that leverages causal discovery and counterfactual reasoning for optimizing system persuasion capability and outcomes. We employ the Greedy Relaxation of the Sparsest Permutation (GRaSP) algorithm to identify causal relationships between user and system utterance strategies, treating user strategies as states and system strategies as actions. GRaSP identifies user strategies as causal factors influencing system responses, which inform Bidirectional Conditional Generative Adversarial Networks (BiCoGAN) in generating counterfactual utterances for the system. Subsequently, we use the Dueling Double Deep Q-Network (D3QN) model to utilize counterfactual data to determine the best policy for selecting system utterances. Our experiments with the PersuasionForGood dataset show measurable improvements in persuasion outcomes using our approach over baseline methods. The observed increase in cumulative rewards and Q-values highlights the effectiveness of causal discovery in enhancing counterfactual reasoning and optimizing reinforcement learning policies for online dialogue systems.

Causal Discovery and Counterfactual Reasoning to Optimize Persuasive Dialogue Policies

TL;DR

This work addresses adapting persuasive dialogue systems to dynamic user states by integrating causal discovery and counterfactual reasoning into reinforcement learning. The pipeline identifies causal relations between persuadee and persuader strategies using GRaSP, generates plausible counterfactual utterances with BiCoGAN guided by these relations, and learns optimal dialogue policies via Dueling Double Deep Q-Networks trained on counterfactual data. Empirical results on the PersuasionForGood dataset show that the causality-guided counterfactual approach yields higher Q-values and greater predicted donations than baselines, demonstrating the value of causal structure in generating effective counterfactuals and informing policy. The study highlights the potential of combining causal inference with counterfactual data to dynamically tailor persuasive interactions, with implications for marketing, health communication, and social-good initiatives, while noting the need for real-user studies and broader personalization in future work.

Abstract

Tailoring persuasive conversations to users leads to more effective persuasion. However, existing dialogue systems often struggle to adapt to dynamically evolving user states. This paper presents a novel method that leverages causal discovery and counterfactual reasoning for optimizing system persuasion capability and outcomes. We employ the Greedy Relaxation of the Sparsest Permutation (GRaSP) algorithm to identify causal relationships between user and system utterance strategies, treating user strategies as states and system strategies as actions. GRaSP identifies user strategies as causal factors influencing system responses, which inform Bidirectional Conditional Generative Adversarial Networks (BiCoGAN) in generating counterfactual utterances for the system. Subsequently, we use the Dueling Double Deep Q-Network (D3QN) model to utilize counterfactual data to determine the best policy for selecting system utterances. Our experiments with the PersuasionForGood dataset show measurable improvements in persuasion outcomes using our approach over baseline methods. The observed increase in cumulative rewards and Q-values highlights the effectiveness of causal discovery in enhancing counterfactual reasoning and optimizing reinforcement learning policies for online dialogue systems.

Paper Structure

This paper contains 13 sections, 9 equations, 8 figures.

Figures (8)

  • Figure 1: An example of a persuasive dialogue between the persuader (ER) and persuadee (EE), which was extracted from the PersuasionForGood dataset (ID:20180826-181951_904_live, donation: $0.05). Each utterance of both ER and EE has been annotated in the dataset with a strategy. By identifying the causal relationship between EE's negative-reaction-to-donation strategy and ER's logical-appeal strategy, the persuader can replace the ground-truth foot-in-the-door strategy in the PersuasionForGood dataset with the more effective logical-appeal strategy. If counterfactual reasoning (CL) is applied, BiCoGAN generates the next action with logical-appeal strategy; otherwise, it is selected from the ground truth with logical-appeal strategy. The resulting donations are: ground truth ($0.05), causality without CL ($0.10), and causality with CL ($1.00).
  • Figure 2: Illustration of our proposed architecture for optimizing persuasive dialogues. Each dialogue utterance is represented using BERT embeddings, while two fine-tuned GPT-2 models for persuadee and persuader predict the dialogue strategy associated with each utterance. The GRaSP method is employed for causal discovery to identify cause-effect relationships between the persuadee's (EE) and the persuader's (ER) strategies, which is influenced by the persuadee’s donation behavior. Specifically, we identify the causal relationship $x_{ee} \rightarrow x_{er}$, which is then incorporated into a retrieval-based model. This model selects the counterfactual action $a'_t$ by retrieving the most probable persuader utterance based on a similarity score with the current state $s_t$. To validate our hypothesis that persuasive dialogues adhere to an underlying structural causal mechanism, we utilize BiCoGAN to generate counterfactual data $\tilde{D}$. The counterfactual dataset is subsequently used to train a Deep Double Q-Network (D3QN), which learns an optimized policy aimed at maximizing Q-values. This process ultimately facilitates the generation of new dialogue strategies that have a higher likelihood of increasing donation amounts.
  • Figure 3: Illustration of the state transition dynamics in a persuasive dialogue. The figure depicts how the persuadee’s state $s_t$ evolves based on the persuader’s action $a_t$, following a structured causal process.
  • Figure 4: (a) The BiCoGAN training process, which consists of a generator ($G$), an encoder ($E$), and a discriminator ($D$). The generator learns to create realistic counterfactual samples by mapping latent representations to persuasive dialogue states, while the encoder reconstructs latent variables from generated samples, and the discriminator distinguishes between real and generated data. (b) Post-training, the fine-tuned GPT-2 model and a retrieval-based model are used for counterfactual action generation. A causal graph connects these models to determine the optimal alternative action $a^{'}_{t}$ for persuasion. The trained generator $G_T$ is then employed to construct the counterfactual next state $s^{'}_{t+1}$, enabling counterfactual reasoning for improved dialogue strategies.
  • Figure 5: The process of generating counterfactual dialogue sequences to improve reward prediction in policy learning. Starting from the initial state $s_0^{'}$, a sequence of action-state pairs ($a^{'}_{i-1}$, $s^{'}_i$) is generated. For each state $s_t^{'}$, the next action $a^{'}_i$ is selected by maximizing the Q-value: $k = \arg \max(Q(s_t^{'}, a_t^{'}))$ for $t = [0..N]$, where $N = 1$. Actions are drawn from the $t$-th dialogue in the counterfactual dataset $\tilde{D}_k$. The dialogue begins with the first several utterances from the ground truth dialogue and continues until the maximum length of 25 utterances is reached.
  • ...and 3 more figures