Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Livia Qian; Gabriel Skantze

Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Livia Qian, Gabriel Skantze

TL;DR

This paper investigates the possibility of embedding short dialogue contexts and feedback responses in the same representation space using a contrastive learning objective and shows that the model outperforms humans given the same ranking task and that the learned embeddings carry information about the conversational function of feedback responses.

Abstract

Short feedback responses, such as backchannels, play an important role in spoken dialogue. So far, most of the modeling of feedback responses has focused on their timing, often neglecting how their lexical and prosodic form influence their contextual appropriateness and conversational function. In this paper, we investigate the possibility of embedding short dialogue contexts and feedback responses in the same representation space using a contrastive learning objective. In our evaluation, we primarily focus on how such embeddings can be used as a context-feedback appropriateness metric and thus for feedback response ranking in U.S. English dialogues. Our results show that the model outperforms humans given the same ranking task and that the learned embeddings carry information about the conversational function of feedback responses.

Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

TL;DR

Abstract

Paper Structure (12 sections, 3 figures, 4 tables)

This paper contains 12 sections, 3 figures, 4 tables.

Introduction
Method
Dataset
Model setup
Metrics
Results
Model results
Human evaluation
Probing classifiers
Visualization of embedding space
Conclusion and Discussion
Acknowledgements

Figures (3)

Figure 1: Outline of our contrastive learning approach where the matrix represents the similarity scores between combinations of context-feedback pairs. The green and red boxes represent positive and negative pairs, respectively. The contexts and feedback responses can be present in audio and/or text format.
Figure 2: Scatter plot of the cosine similarity score (x-axis) assigned by our best audio-based model trained on Switchboard (HuBERT) and the participants' ratings (y-axis). For each data point, the similarity score and rating are based on context-feedback pairs with the same function label.
Figure 3: The t-SNE plot of learned audio-and-text embeddings where the feedback and context embeddings are concatenated. Abbreviations of functions are defined in Section \ref{['sec:dataset']}.

Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

TL;DR

Abstract

Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Authors

TL;DR

Abstract

Table of Contents

Figures (3)