Exploiting temporal information to detect conversational groups in videos and predict the next speaker
Lucrezia Tosato, Victor Fortier, Isabelle Bloch, Catherine Pelachaud
TL;DR
The paper tackles detecting F-formations and predicting the next speaker in group conversations from video by exploiting temporal information and engagement signals. It introduces a time-weighted center-of-attention and memory-based clustering to robustly identify formation groups, and an LSTM-based model to predict the next speaker from a small set of labeled nonverbal activities. Evaluated on the MatchNMingle dataset, the approach achieves $85\%$ true positives for F-formation detection and $98\%$ accuracy for next-speaker prediction, highlighting the benefits of temporal modeling for social interaction analysis. The work advances automatic analysis of social dynamics with practical implications for video understanding and human–computer interaction in crowded or dyadic settings.
Abstract
Studies in human human interaction have introduced the concept of F formation to describe the spatial arrangement of participants during social interactions. This paper has two objectives. It aims at detecting F formations in video sequences and predicting the next speaker in a group conversation. The proposed approach exploits time information and human multimodal signals in video sequences. In particular, we rely on measuring the engagement level of people as a feature of group belonging. Our approach makes use of a recursive neural network, the Long Short Term Memory (LSTM), to predict who will take the speaker's turn in a conversation group. Experiments on the MatchNMingle dataset led to 85% true positives in group detection and 98% accuracy in predicting the next speaker.
