Counterfactual Reasoning Using Predicted Latent Personality Dimensions for Optimizing Persuasion Outcome
Donghuo Zeng, Roberto S. Legaspi, Yuewen Sun, Xinshuai Dong, Kazushi Ikeda, Peter Spirtes, kun Zhang
TL;DR
The paper tackles dynamic persuasion by modeling user-specific latent personality dimensions (LPDs) and leveraging counterfactual reasoning to optimize dialogue. It introduces a DPPR model to estimate time-varying LPDs, uses BiCoGAN to generate counterfactual dialogue states conditioned on LPDs, and applies D3QN to learn policies from the counterfactual data. Empirical results on PersuasionForGood show that combining DPPR with BiCoGAN yields higher cumulative rewards and Q-values than both BiCoGAN alone and ground-truth baselines, demonstrating the value of personalized, counterfactual-aware policy learning in online interactions. This approach advances practical dialogue systems by enabling dynamic adaptation to evolving user traits and richer exploration through counterfactual scenarios, with potential impact on the effectiveness of persuasion technologies in real-world settings.
Abstract
Customizing persuasive conversations related to the outcome of interest for specific users achieves better persuasion results. However, existing persuasive conversation systems rely on persuasive strategies and encounter challenges in dynamically adjusting dialogues to suit the evolving states of individual users during interactions. This limitation restricts the system's ability to deliver flexible or dynamic conversations and achieve suboptimal persuasion outcomes. In this paper, we present a novel approach that tracks a user's latent personality dimensions (LPDs) during ongoing persuasion conversation and generates tailored counterfactual utterances based on these LPDs to optimize the overall persuasion outcome. In particular, our proposed method leverages a Bi-directional Generative Adversarial Network (BiCoGAN) in tandem with a Dialogue-based Personality Prediction Regression (DPPR) model to generate counterfactual data. This enables the system to formulate alternative persuasive utterances that are more suited to the user. Subsequently, we utilize the D3QN model to learn policies for optimized selection of system utterances on counterfactual data. Experimental results we obtained from using the PersuasionForGood dataset demonstrate the superiority of our approach over the existing method, BiCoGAN. The cumulative rewards and Q-values produced by our method surpass ground truth benchmarks, showcasing the efficacy of employing counterfactual reasoning and LPDs to optimize reinforcement learning policy in online interactions.
