Table of Contents
Fetching ...

Modeling social interaction dynamics using temporal graph networks

J. Taery Kim, Archit Naik, Isuru Jayarathne, Sehoon Ha, Jouh Yeong Chew

TL;DR

This work addresses the challenge of modeling multiparty social interaction dynamics for human-robot collaboration by leveraging an adapted Temporal Graph Network (TGN) that fuses gaze, speech, and environmental context. The model is trained on a gaze-edge prediction task and evaluated on next gaze and next speaker prediction, demonstrating substantial improvements over a history-based baseline (+37.0% F1 for gaze and +29.0% F1 for speaker). A key contribution is a compact, efficient message-passing scheme that reduces message size from 768 to 14 elements, along with a two-phase learning strategy that enables both graph dynamics learning and downstream node prediction. The approach generalizes to varying group sizes and lays groundwork for downstream tasks such as human state inference and intention estimation, while remaining extensible to additional modalities and contexts.

Abstract

Integrating intelligent systems, such as robots, into dynamic group settings poses challenges due to the mutual influence of human behaviors and internal states. A robust representation of social interaction dynamics is essential for effective human-robot collaboration. Existing approaches often narrow their focus to facial expressions or speech, overlooking the broader context. We propose employing an adapted Temporal Graph Networks to comprehensively represent social interaction dynamics while enabling its practical implementation. Our method incorporates temporal multi-modal behavioral data including gaze interaction, voice activity and environmental context. This representation of social interaction dynamics is trained as a link prediction problem using annotated gaze interaction data. The F1-score outperformed the baseline model by 37.0%. This improvement is consistent for a secondary task of next speaker prediction which achieves an improvement of 29.0%. Our contributions are two-fold, including a model to representing social interaction dynamics which can be used for many downstream human-robot interaction tasks like human state inference and next speaker prediction. More importantly, this is achieved using a more concise yet efficient message passing method, significantly reducing it from 768 to 14 elements, while outperforming the baseline model.

Modeling social interaction dynamics using temporal graph networks

TL;DR

This work addresses the challenge of modeling multiparty social interaction dynamics for human-robot collaboration by leveraging an adapted Temporal Graph Network (TGN) that fuses gaze, speech, and environmental context. The model is trained on a gaze-edge prediction task and evaluated on next gaze and next speaker prediction, demonstrating substantial improvements over a history-based baseline (+37.0% F1 for gaze and +29.0% F1 for speaker). A key contribution is a compact, efficient message-passing scheme that reduces message size from 768 to 14 elements, along with a two-phase learning strategy that enables both graph dynamics learning and downstream node prediction. The approach generalizes to varying group sizes and lays groundwork for downstream tasks such as human state inference and intention estimation, while remaining extensible to additional modalities and contexts.

Abstract

Integrating intelligent systems, such as robots, into dynamic group settings poses challenges due to the mutual influence of human behaviors and internal states. A robust representation of social interaction dynamics is essential for effective human-robot collaboration. Existing approaches often narrow their focus to facial expressions or speech, overlooking the broader context. We propose employing an adapted Temporal Graph Networks to comprehensively represent social interaction dynamics while enabling its practical implementation. Our method incorporates temporal multi-modal behavioral data including gaze interaction, voice activity and environmental context. This representation of social interaction dynamics is trained as a link prediction problem using annotated gaze interaction data. The F1-score outperformed the baseline model by 37.0%. This improvement is consistent for a secondary task of next speaker prediction which achieves an improvement of 29.0%. Our contributions are two-fold, including a model to representing social interaction dynamics which can be used for many downstream human-robot interaction tasks like human state inference and next speaker prediction. More importantly, this is achieved using a more concise yet efficient message passing method, significantly reducing it from 768 to 14 elements, while outperforming the baseline model.
Paper Structure (19 sections, 4 figures, 4 tables)

This paper contains 19 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of social interaction dynamics modeling. (a) Two problems are formulated for this study, A: scene perception and representation using multi-modal inputs, and B: the application of A for downstream tasks like the next speaker prediction. (b) Modeling social interaction dynamics into a temporal graph to learn via temporal graph neural networks. We tackle the next gaze prediction as problem A during phase 1 and the next speaking prediction as problem B during phase 2.
  • Figure 2: Preprocessing multiparty interaction data and modeling the interaction using temporal graph network model.
  • Figure 3: Sensors worn by each subject during the experiment.
  • Figure 4: Comparison of different baselines based on Speed vs. Accuracy trade-off with model trained on S14.