Table of Contents
Fetching ...

Learning to Generate Context-Sensitive Backchannel Smiles for Embodied AI Agents with Applications in Mental Health Dialogues

Maneesh Bilalpur, Mert Inan, Dorsa Zeinali, Jeffrey F. Cohn, Malihe Alikhani

TL;DR

The paper tackles the scarcity of mental health resources by developing embodied AI agents that communicate with context-sensitive backchannel smiles. It introduces a data-driven approach that leverages both speaker and listener cues—prosody, linguistic features, and demographics—to predict smile intensity and duration, and it presents an attention-based generative framework conditioned on these predictors to synthesize realistic facial landmarks. The authors demonstrate that incorporating listener behavior and a conditioning vector improves objective landmark-generation metrics and elicits more human-like perceptions in a Furhat embodiment during a user study. This work bridges landmark-based facial motion generation with physical embodiment, offering a scalable, ethical pathway for enhancing digital mental health interventions and multimodal human-agent rapport-building.

Abstract

Addressing the critical shortage of mental health resources for effective screening, diagnosis, and treatment remains a significant challenge. This scarcity underscores the need for innovative solutions, particularly in enhancing the accessibility and efficacy of therapeutic support. Embodied agents with advanced interactive capabilities emerge as a promising and cost-effective supplement to traditional caregiving methods. Crucial to these agents' effectiveness is their ability to simulate non-verbal behaviors, like backchannels, that are pivotal in establishing rapport and understanding in therapeutic contexts but remain under-explored. To improve the rapport-building capabilities of embodied agents we annotated backchannel smiles in videos of intimate face-to-face conversations over topics such as mental health, illness, and relationships. We hypothesized that both speaker and listener behaviors affect the duration and intensity of backchannel smiles. Using cues from speech prosody and language along with the demographics of the speaker and listener, we found them to contain significant predictors of the intensity of backchannel smiles. Based on our findings, we introduce backchannel smile production in embodied agents as a generation problem. Our attention-based generative model suggests that listener information offers performance improvements over the baseline speaker-centric generation approach. Conditioned generation using the significant predictors of smile intensity provides statistically significant improvements in empirical measures of generation quality. Our user study by transferring generated smiles to an embodied agent suggests that agent with backchannel smiles is perceived to be more human-like and is an attractive alternative for non-personal conversations over agent without backchannel smiles.

Learning to Generate Context-Sensitive Backchannel Smiles for Embodied AI Agents with Applications in Mental Health Dialogues

TL;DR

The paper tackles the scarcity of mental health resources by developing embodied AI agents that communicate with context-sensitive backchannel smiles. It introduces a data-driven approach that leverages both speaker and listener cues—prosody, linguistic features, and demographics—to predict smile intensity and duration, and it presents an attention-based generative framework conditioned on these predictors to synthesize realistic facial landmarks. The authors demonstrate that incorporating listener behavior and a conditioning vector improves objective landmark-generation metrics and elicits more human-like perceptions in a Furhat embodiment during a user study. This work bridges landmark-based facial motion generation with physical embodiment, offering a scalable, ethical pathway for enhancing digital mental health interventions and multimodal human-agent rapport-building.

Abstract

Addressing the critical shortage of mental health resources for effective screening, diagnosis, and treatment remains a significant challenge. This scarcity underscores the need for innovative solutions, particularly in enhancing the accessibility and efficacy of therapeutic support. Embodied agents with advanced interactive capabilities emerge as a promising and cost-effective supplement to traditional caregiving methods. Crucial to these agents' effectiveness is their ability to simulate non-verbal behaviors, like backchannels, that are pivotal in establishing rapport and understanding in therapeutic contexts but remain under-explored. To improve the rapport-building capabilities of embodied agents we annotated backchannel smiles in videos of intimate face-to-face conversations over topics such as mental health, illness, and relationships. We hypothesized that both speaker and listener behaviors affect the duration and intensity of backchannel smiles. Using cues from speech prosody and language along with the demographics of the speaker and listener, we found them to contain significant predictors of the intensity of backchannel smiles. Based on our findings, we introduce backchannel smile production in embodied agents as a generation problem. Our attention-based generative model suggests that listener information offers performance improvements over the baseline speaker-centric generation approach. Conditioned generation using the significant predictors of smile intensity provides statistically significant improvements in empirical measures of generation quality. Our user study by transferring generated smiles to an embodied agent suggests that agent with backchannel smiles is perceived to be more human-like and is an attractive alternative for non-personal conversations over agent without backchannel smiles.
Paper Structure (24 sections, 3 equations, 9 figures, 4 tables)

This paper contains 24 sections, 3 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of steps for backchannel smile generation in an embodied agent in a human-agent interaction: Speaker and listener (agent) turns are used to generate the listener's response facial expression as landmarks. The landmarks are then integrated with the embodied agent and added to the conversation flow represented as a dotted arrow.
  • Figure 2: Distribution of speaker and listener sex across different interpersonal relationships in annotated RealTalk dataset. Relationships are color-coded: siblings (pink), friends (orange), paternal (green), and romantic couple (grey).
  • Figure 3: Regression slopes showing the effect of context cues on the intensity of BC smiles. A positive slope indicates the smile intensity increases with a given feature (vice-versa for a negative slope). * indicates slope is significant at p<0.05 and ${\hbox{$\boldsymbol{\cdot}$}}$ indicates marginal significance at p<0.1.
  • Figure 4: Architecture of a generative model incorporating the significant predictors (conditioning vector) for backchannel smiles. Encoder input contains speech embeddings of listener and speaker from the pretrained vggish model. The encoder's final hidden state is concatenated with the conditioning vector and then used to initialize the decoder's hidden state. Decoder output landmarks are sequentially fed (dotted curves) to generate the next landmarks in the output sequence.
  • Figure 5: Effect of duration and intensity of smile along with ablation of inputs on generative model performance measured using APE (top) and PCK (bottom). S & C-speaker and conditioning vector, S & L-speaker and listener, and S, L & C-speaker and listener and conditioning vector as inputs to the model. '${\hbox{$\boldsymbol{\cdot}$}}$', '*' and '***' indicate significance with p < 0.1, p < 0.05 and p < 0.001 respectively.
  • ...and 4 more figures