Table of Contents
Fetching ...

Exploiting temporal information to detect conversational groups in videos and predict the next speaker

Lucrezia Tosato, Victor Fortier, Isabelle Bloch, Catherine Pelachaud

TL;DR

The paper tackles detecting F-formations and predicting the next speaker in group conversations from video by exploiting temporal information and engagement signals. It introduces a time-weighted center-of-attention and memory-based clustering to robustly identify formation groups, and an LSTM-based model to predict the next speaker from a small set of labeled nonverbal activities. Evaluated on the MatchNMingle dataset, the approach achieves $85\%$ true positives for F-formation detection and $98\%$ accuracy for next-speaker prediction, highlighting the benefits of temporal modeling for social interaction analysis. The work advances automatic analysis of social dynamics with practical implications for video understanding and human–computer interaction in crowded or dyadic settings.

Abstract

Studies in human human interaction have introduced the concept of F formation to describe the spatial arrangement of participants during social interactions. This paper has two objectives. It aims at detecting F formations in video sequences and predicting the next speaker in a group conversation. The proposed approach exploits time information and human multimodal signals in video sequences. In particular, we rely on measuring the engagement level of people as a feature of group belonging. Our approach makes use of a recursive neural network, the Long Short Term Memory (LSTM), to predict who will take the speaker's turn in a conversation group. Experiments on the MatchNMingle dataset led to 85% true positives in group detection and 98% accuracy in predicting the next speaker.

Exploiting temporal information to detect conversational groups in videos and predict the next speaker

TL;DR

The paper tackles detecting F-formations and predicting the next speaker in group conversations from video by exploiting temporal information and engagement signals. It introduces a time-weighted center-of-attention and memory-based clustering to robustly identify formation groups, and an LSTM-based model to predict the next speaker from a small set of labeled nonverbal activities. Evaluated on the MatchNMingle dataset, the approach achieves true positives for F-formation detection and accuracy for next-speaker prediction, highlighting the benefits of temporal modeling for social interaction analysis. The work advances automatic analysis of social dynamics with practical implications for video understanding and human–computer interaction in crowded or dyadic settings.

Abstract

Studies in human human interaction have introduced the concept of F formation to describe the spatial arrangement of participants during social interactions. This paper has two objectives. It aims at detecting F formations in video sequences and predicting the next speaker in a group conversation. The proposed approach exploits time information and human multimodal signals in video sequences. In particular, we rely on measuring the engagement level of people as a feature of group belonging. Our approach makes use of a recursive neural network, the Long Short Term Memory (LSTM), to predict who will take the speaker's turn in a conversation group. Experiments on the MatchNMingle dataset led to 85% true positives in group detection and 98% accuracy in predicting the next speaker.
Paper Structure (9 sections, 2 equations, 6 figures, 1 table)

This paper contains 9 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Pipeline for F-formation detection and analysis.
  • Figure 2: Center of attention calculated with $\psi$ (green dot) and with $\theta$ (pink dot), frame 17417, day one, camera one.
  • Figure 3: F-formations (orange circles), day one, camera one.
  • Figure 4: Number of F-formations per frame, with ground truth indicated by the yellow line.
  • Figure 5: Analysis of the dyad formed by persons 9 and 28.
  • ...and 1 more figures