Co-Speech Gesture Detection through Multi-Phase Sequence Labeling

Esam Ghaleb; Ilya Burenko; Marlou Rasenberg; Wim Pouw; Peter Uhrig; Judith Holler; Ivan Toni; Aslı Özyürek; Raquel Fernández

Co-Speech Gesture Detection through Multi-Phase Sequence Labeling

Esam Ghaleb, Ilya Burenko, Marlou Rasenberg, Wim Pouw, Peter Uhrig, Judith Holler, Ivan Toni, Aslı Özyürek, Raquel Fernández

TL;DR

The paper tackles automatic gesture detection in naturalistic dialogue by reframing the task as multi-phase sequence labeling rather than binary detection. It introduces a framework that embeds sequences of skeletal movements with ST-GCNs, encodes them with Transformer layers, and applies a CRF for structured prediction over gesture phases $\{P,S,R,N\}$. On a dataset of 38 speakers and 16 hours of co-speech gestures, the method outperforms binary and classification baselines, particularly in stroke detection and gesture-unit detection, and reveals insights into phase boundaries and latent structure. The results demonstrate that modeling the sequential dynamics and phase transitions yields more accurate and robust gesture detection, with practical implications for human-computer interaction and social-behavior analysis.

Abstract

Gestures are integral components of face-to-face communication. They unfold over time, often following predictable movement phases of preparation, stroke, and retraction. Yet, the prevalent approach to automatic gesture detection treats the problem as binary classification, classifying a segment as either containing a gesture or not, thus failing to capture its inherently sequential and contextual nature. To address this, we introduce a novel framework that reframes the task as a multi-phase sequence labeling problem rather than binary classification. Our model processes sequences of skeletal movements over time windows, uses Transformer encoders to learn contextual embeddings, and leverages Conditional Random Fields to perform sequence labeling. We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues. The results consistently demonstrate that our method significantly outperforms strong baseline models in detecting gesture strokes. Furthermore, applying Transformer encoders to learn contextual embeddings from movement sequences substantially improves gesture unit detection. These results highlight our framework's capacity to capture the fine-grained dynamics of co-speech gesture phases, paving the way for more nuanced and accurate gesture detection and analysis.

Co-Speech Gesture Detection through Multi-Phase Sequence Labeling

TL;DR

. On a dataset of 38 speakers and 16 hours of co-speech gestures, the method outperforms binary and classification baselines, particularly in stroke detection and gesture-unit detection, and reveals insights into phase boundaries and latent structure. The results demonstrate that modeling the sequential dynamics and phase transitions yields more accurate and robust gesture detection, with practical implications for human-computer interaction and social-behavior analysis.

Abstract

Paper Structure (29 sections, 2 equations, 7 figures, 4 tables)

This paper contains 29 sections, 2 equations, 7 figures, 4 tables.

Introduction
Related Work
Data Modeling in Gesture Analysis
Gesture Detection
Data and Preprocessing
Dataset
Constructing Multi-Phase Sequential Data
Representing Time Windows
Sequence Labeling for Gesture Detection
Problem Definition
Model Architecture
Embedding Time Windows via ST-GCNs
Transformer-based Sequence Encoding
Position-wise Prediction Layers
Structured Prediction via CRFs
...and 14 more sections

Figures (7)

Figure 1: A gesture unit consists of sequential gestural phases. Figure adapted from Sanchez et al.sanchez2022gesture.
Figure 2: Data collection setup: Two participants play a referential game, freely communicating using speech and gestures.
Figure 3: A spatio-temporal graph is extracted from the estimated upper body pose sengupta2020mm, adapted from Jiang et al.jiang2021skeleton.
Figure 4: The architecture of the multi-phase sequence labeler consisting of the model components described in Section \ref{['sec:architecture']}.
Figure 5: Illustration of linear chain CRF for a sequence of 7 states, i.e., gesture phases. Each observed input $\bm{x}^{(i)}$ represents a segment of the video and each state $y^{(i)}$ corresponds to the phase of that segment. The arrows represent the dependencies between the observations and the states and between the states in a sequence.
...and 2 more figures

Co-Speech Gesture Detection through Multi-Phase Sequence Labeling

TL;DR

Abstract

Co-Speech Gesture Detection through Multi-Phase Sequence Labeling

Authors

TL;DR

Abstract

Table of Contents

Figures (7)