Modeling social interaction dynamics using temporal graph networks
J. Taery Kim, Archit Naik, Isuru Jayarathne, Sehoon Ha, Jouh Yeong Chew
TL;DR
This work addresses the challenge of modeling multiparty social interaction dynamics for human-robot collaboration by leveraging an adapted Temporal Graph Network (TGN) that fuses gaze, speech, and environmental context. The model is trained on a gaze-edge prediction task and evaluated on next gaze and next speaker prediction, demonstrating substantial improvements over a history-based baseline (+37.0% F1 for gaze and +29.0% F1 for speaker). A key contribution is a compact, efficient message-passing scheme that reduces message size from 768 to 14 elements, along with a two-phase learning strategy that enables both graph dynamics learning and downstream node prediction. The approach generalizes to varying group sizes and lays groundwork for downstream tasks such as human state inference and intention estimation, while remaining extensible to additional modalities and contexts.
Abstract
Integrating intelligent systems, such as robots, into dynamic group settings poses challenges due to the mutual influence of human behaviors and internal states. A robust representation of social interaction dynamics is essential for effective human-robot collaboration. Existing approaches often narrow their focus to facial expressions or speech, overlooking the broader context. We propose employing an adapted Temporal Graph Networks to comprehensively represent social interaction dynamics while enabling its practical implementation. Our method incorporates temporal multi-modal behavioral data including gaze interaction, voice activity and environmental context. This representation of social interaction dynamics is trained as a link prediction problem using annotated gaze interaction data. The F1-score outperformed the baseline model by 37.0%. This improvement is consistent for a secondary task of next speaker prediction which achieves an improvement of 29.0%. Our contributions are two-fold, including a model to representing social interaction dynamics which can be used for many downstream human-robot interaction tasks like human state inference and next speaker prediction. More importantly, this is achieved using a more concise yet efficient message passing method, significantly reducing it from 768 to 14 elements, while outperforming the baseline model.
