Table of Contents
Fetching ...

Reward-Driven Interaction: Enhancing Proactive Dialogue Agents through User Satisfaction Prediction

Wei Shen, Xiaonan He, Chuheng Zhang, Xuyun Zhang, Xiaolong Xu, Wanchun Dou

TL;DR

This paper tackles reward-driven proactive dialogue by addressing noisy reward supervision and long-tail feedback sparsity in industrial systems. It introduces two auxiliary tasks—contrastive self-supervised learning for rare ASR-induced utterances and domain-intent classification for long-tailed domains—integrated via multi-task learning to enhance user satisfaction prediction. The approach is implemented on a transformer-based proactive interaction mechanism within DuerOS and evaluated both offline on a large industrial dataset and online via A/B testing, achieving significant improvements in error recognition and contextual user satisfaction. The work provides a practical, deployable enhancement to proactive dialogue systems that improves robustness to ASR errors and domain skew, with direct implications for real-world user experience.

Abstract

Reward-driven proactive dialogue agents require precise estimation of user satisfaction as an intrinsic reward signal to determine optimal interaction strategies. Specifically, this framework triggers clarification questions when detecting potential user dissatisfaction during interactions in the industrial dialogue system. Traditional works typically rely on training a neural network model based on weak labels which are generated by a simple model trained on user actions after current turn. However, existing methods suffer from two critical limitations in real-world scenarios: (1) Noisy Reward Supervision, dependence on weak labels derived from post-hoc user actions introduces bias, particularly failing to capture satisfaction signals in ASR-error-induced utterances; (2) Long-Tail Feedback Sparsity, the power-law distribution of user queries causes reward prediction accuracy to drop in low-frequency domains. The noise in the weak labels and a power-law distribution of user utterances results in that the model is hard to learn good representation of user utterances and sessions. To address these limitations, we propose two auxiliary tasks to improve the representation learning of user utterances and sessions that enhance user satisfaction prediction. The first one is a contrastive self-supervised learning task, which helps the model learn the representation of rare user utterances and identify ASR errors. The second one is a domain-intent classification task, which aids the model in learning the representation of user sessions from long-tailed domains and improving the model's performance on such domains. The proposed method is evaluated on DuerOS, demonstrating significant improvements in the accuracy of error recognition on rare user utterances and long-tailed domains.

Reward-Driven Interaction: Enhancing Proactive Dialogue Agents through User Satisfaction Prediction

TL;DR

This paper tackles reward-driven proactive dialogue by addressing noisy reward supervision and long-tail feedback sparsity in industrial systems. It introduces two auxiliary tasks—contrastive self-supervised learning for rare ASR-induced utterances and domain-intent classification for long-tailed domains—integrated via multi-task learning to enhance user satisfaction prediction. The approach is implemented on a transformer-based proactive interaction mechanism within DuerOS and evaluated both offline on a large industrial dataset and online via A/B testing, achieving significant improvements in error recognition and contextual user satisfaction. The work provides a practical, deployable enhancement to proactive dialogue systems that improves robustness to ASR errors and domain skew, with direct implications for real-world user experience.

Abstract

Reward-driven proactive dialogue agents require precise estimation of user satisfaction as an intrinsic reward signal to determine optimal interaction strategies. Specifically, this framework triggers clarification questions when detecting potential user dissatisfaction during interactions in the industrial dialogue system. Traditional works typically rely on training a neural network model based on weak labels which are generated by a simple model trained on user actions after current turn. However, existing methods suffer from two critical limitations in real-world scenarios: (1) Noisy Reward Supervision, dependence on weak labels derived from post-hoc user actions introduces bias, particularly failing to capture satisfaction signals in ASR-error-induced utterances; (2) Long-Tail Feedback Sparsity, the power-law distribution of user queries causes reward prediction accuracy to drop in low-frequency domains. The noise in the weak labels and a power-law distribution of user utterances results in that the model is hard to learn good representation of user utterances and sessions. To address these limitations, we propose two auxiliary tasks to improve the representation learning of user utterances and sessions that enhance user satisfaction prediction. The first one is a contrastive self-supervised learning task, which helps the model learn the representation of rare user utterances and identify ASR errors. The second one is a domain-intent classification task, which aids the model in learning the representation of user sessions from long-tailed domains and improving the model's performance on such domains. The proposed method is evaluated on DuerOS, demonstrating significant improvements in the accuracy of error recognition on rare user utterances and long-tailed domains.

Paper Structure

This paper contains 23 sections, 17 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An overview of our model. The model contains three parts: 1) Left part of the model, the ASR and query match module, is to model the match between user's current voice and ASR decoded texts that contains the original query, n-best queries from the ASR module, and the final query from the rewrite module. 2) Middle part of the model, the query and reply match module, is to model the match between user utterances and system results that contains the final query, NLU results, and title info from the IR module. 3) Right part of the model, the user session match module, is to model the match among user recent turns, which contains multi-turn user utterances and system information (system response, the intervals between two turns, and the system NLU results).
  • Figure 2: The analysis of the online error cases. Among the unrecognized 230 user dissatisfied cases, the ASR errors are the main error type and occupy 48% of total 220 cases. Furthermore, we find that some long-tailed domains e.g, Universal QA, occupy a similar account of undiscovered errors with those on the main domain, i.e., media domain
  • Figure 3: Two auxiliary tasks to enhance the representation learning of the user utterances and user sessions. The left part is a contrastive self-supervised learning task: we pass the same user queries to the embedding layer twice and apply the standard dropout operator to the duplicated vectors of queries before multiple multi-head attention layers. Then we can obtain two different embeddings as positive pairs, take other queries in the same mini-batch as negatives and the auxiliary task is to predict the positive one among negatives. The right part is the domain-intent classification task: we first extract the representation of the user session, pass the vector of the symbol SEP in the current turn to a fully connected layer, and adopt cross-entropy loss to train the classification task.
  • Figure 4: The online analysis of the performance of TBM-2 and ABM. We further compare the performance of these two methods on different domains and different error types. Specifically, we filter the user dissatisfaction sessions from 1000 user sessions and divide them by domains and error types. Among 1000 online cases, there are 310 user dissatisfaction user sessions out of 1000 user sessions. The ABM and TBM-2 have a similar accuracy of error recognition (86.5% and 85%). However, the ABM method recalls 38 ASR errors from a total of 119 ASR errors and 10 NLU errors from a total of 61 NLU errors, which is better than TBM-2 model that recall 30 ASR errors and 5 NLU errors. In addition, the ABM method recalls 15 user-dissatisfied sessions on the universal QA domain, which is significantly better than the TBM-2 model that recall 7 sessions.