Table of Contents
Fetching ...

Multilingual Dyadic Interaction Corpus NoXi+J: Toward Understanding Asian-European Non-verbal Cultural Characteristics and their Influences on Engagement

Marius Funk, Shogo Okada, Elisabeth André

TL;DR

This work addresses how cross-cultural differences in non-verbal behavior influence engagement in dyadic conversations by introducing NoXi+J, a multilingual multimodal corpus that extends the NoXi dataset with Japanese and Chinese data. It systematically extracts and analyzes 94 non-verbal features across five languages to identify cultural differences in cues such as smiles, backchanneling, and head movement, and assesses their relationship to engagement. The authors train LSTM-based engagement predictors, evaluate cross-language transfer, and use SHAP analysis to reveal culture-specific feature importance, finding that transfer learning can notably improve cross-language predictions, especially for Japanese data. The study provides a publicly available, culturally informed benchmark for engagement prediction and highlights the need for culture-sensitive AI agents in real-world human-computer interaction.

Abstract

Non-verbal behavior is a central challenge in understanding the dynamics of a conversation and the affective states between interlocutors arising from the interaction. Although psychological research has demonstrated that non-verbal behaviors vary across cultures, limited computational analysis has been conducted to clarify these differences and assess their impact on engagement recognition. To gain a greater understanding of engagement and non-verbal behaviors among a wide range of cultures and language spheres, in this study we conduct a multilingual computational analysis of non-verbal features and investigate their role in engagement and engagement prediction. To achieve this goal, we first expanded the NoXi dataset, which contains interaction data from participants living in France, Germany, and the United Kingdom, by collecting session data of dyadic conversations in Japanese and Chinese, resulting in the enhanced dataset NoXi+J. Next, we extracted multimodal non-verbal features, including speech acoustics, facial expressions, backchanneling and gestures, via various pattern recognition techniques and algorithms. Then, we conducted a statistical analysis of listening behaviors and backchannel patterns to identify culturally dependent and independent features in each language and common features among multiple languages. These features were also correlated with the engagement shown by the interlocutors. Finally, we analyzed the influence of cultural differences in the input features of LSTM models trained to predict engagement for five language datasets. A SHAP analysis combined with transfer learning confirmed a considerable correlation between the importance of input features for a language set and the significant cultural characteristics analyzed.

Multilingual Dyadic Interaction Corpus NoXi+J: Toward Understanding Asian-European Non-verbal Cultural Characteristics and their Influences on Engagement

TL;DR

This work addresses how cross-cultural differences in non-verbal behavior influence engagement in dyadic conversations by introducing NoXi+J, a multilingual multimodal corpus that extends the NoXi dataset with Japanese and Chinese data. It systematically extracts and analyzes 94 non-verbal features across five languages to identify cultural differences in cues such as smiles, backchanneling, and head movement, and assesses their relationship to engagement. The authors train LSTM-based engagement predictors, evaluate cross-language transfer, and use SHAP analysis to reveal culture-specific feature importance, finding that transfer learning can notably improve cross-language predictions, especially for Japanese data. The study provides a publicly available, culturally informed benchmark for engagement prediction and highlights the need for culture-sensitive AI agents in real-world human-computer interaction.

Abstract

Non-verbal behavior is a central challenge in understanding the dynamics of a conversation and the affective states between interlocutors arising from the interaction. Although psychological research has demonstrated that non-verbal behaviors vary across cultures, limited computational analysis has been conducted to clarify these differences and assess their impact on engagement recognition. To gain a greater understanding of engagement and non-verbal behaviors among a wide range of cultures and language spheres, in this study we conduct a multilingual computational analysis of non-verbal features and investigate their role in engagement and engagement prediction. To achieve this goal, we first expanded the NoXi dataset, which contains interaction data from participants living in France, Germany, and the United Kingdom, by collecting session data of dyadic conversations in Japanese and Chinese, resulting in the enhanced dataset NoXi+J. Next, we extracted multimodal non-verbal features, including speech acoustics, facial expressions, backchanneling and gestures, via various pattern recognition techniques and algorithms. Then, we conducted a statistical analysis of listening behaviors and backchannel patterns to identify culturally dependent and independent features in each language and common features among multiple languages. These features were also correlated with the engagement shown by the interlocutors. Finally, we analyzed the influence of cultural differences in the input features of LSTM models trained to predict engagement for five language datasets. A SHAP analysis combined with transfer learning confirmed a considerable correlation between the importance of input features for a language set and the significant cultural characteristics analyzed.
Paper Structure (40 sections, 6 figures, 7 tables)

This paper contains 40 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: NoXi recording. Expert (left) and novice (right).
  • Figure 2: Age distribution of the speakers of the 5 primary recorded languages: German (DE), French (FR), English (EN), Japanese (JP) and Chinese (ZH). NoXi
  • Figure 3: Schematic conversation depicting turn-taking, the division of the data by speaking state, engagement and instances of high positive and high negative engagement correlation. VBC describes vocal backchanneling instances.
  • Figure 4: Correlations between annotated novice engagement and a selection of relevant features.
  • Figure 5: Correlations between expert and novice engagement.
  • ...and 1 more figures