Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

Jing Ye; Xinpei Zhao; Lu Xiang; Yaping Zhang; Chengqing Zong

Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

Jing Ye, Xinpei Zhao, Lu Xiang, Yaping Zhang, Chengqing Zong

Abstract

While current emotional support dialogue systems typically rely on expert-defined scalar rewards for alignment, these signals suffer from severe information sparsity. They cannot explain why a response failed or how to adapt to dynamic user states, often diverging from the actual goal of facilitating positive emotional shifts. In practice, the most direct and reliable learning signal emerges from the user's continuous reactions during ongoing interaction. We therefore propose Reaction Aware Policy Optimization (RAPO), a framework that optimizes over interaction consequences rather than rubric scores. RAPO treats dialogue as a reaction-driven process and utilizes simulated user responses to generate dense natural-language feedback through three core components: Hindsight Dialogue Selection, which isolates pivotal turns that meaningfully alter user emotional trajectories; Generative Hindsight Feedback, which transforms user reactions into contrastive ranking signals and natural-language critiques; and Scalar-Verbal Hybrid Policy Optimization, which couples scalar reward optimization for global alignment with verbal feedback distillation for fine-grained semantic refinement. Extensive experiments on ESC and Sotopia demonstrate that RAPO significantly outperforms strong reinforcement learning baselines in driving positive interaction outcomes.

Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

Abstract

Paper Structure (69 sections, 23 equations, 4 figures, 9 tables)

This paper contains 69 sections, 23 equations, 4 figures, 9 tables.

Introduction
Preliminaries
Group Relative Policy Optimization
On-Policy Distillation
Method
Reaction-Aware Problem Formulation
Hindsight Dialogue Selection
Generative Hindsight Feedback
Group-wise Exploration and User Simulation.
Contrastive Critique Generation.
Scalar–Verbal Hybrid Policy Optimization
Scalar RL
Verbal RL
Self-Teacher Construction.
Optimization.
...and 54 more sections

Figures (4)

Figure 1: Comparison between expert-centric scalar rewards and user-reaction aware mixed rewards.
Figure 2: Overview of the RAPO framework. It integrates user simulation, contrastive critique generation, and scalar-verbal hybrid policy optimization.
Figure 3: Results of the pair-wise human evaluation. We randomly sampled 50 simulated dialogue instances for each task. For each instance, annotators are presented with the dialogue context and two candidate responses (A and B) generated by different models. $\blacksquare$ indicates 'A win', $\blacksquare$ indicates 'tie', and $\blacksquare$ indicates 'B win'.
Figure 4: Training dynamics on ESConv. Left: Actor policy entropy during training. Middle: Average critic reward score. Right: Mean response length generated by the policy.

Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

Abstract

Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

Authors

Abstract

Table of Contents

Figures (4)