Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics
Masoumeh Chapariniya, Teodora Vukovic, Sarah Ebling, Volker Dellwo
TL;DR
This work addresses person identification from natural conversational dynamics using a transformer-based framework that operates on skeletal keypoints, avoiding appearance cues. It introduces a two-stream architecture with a spatial transformer for frame-level postures and a multi-scale temporal transformer for hierarchical motion, followed by feature-level fusion. Domain-specific training outperforms pretraining, with spatial cues achieving 95.74% accuracy, temporal dynamics 93.90% with multi-scale modeling, and fusion pushing accuracy to 98.03% on 114 speakers. The results demonstrate that conversational behavior contains distinctive identity signatures, enabling privacy-preserving biometrics and informing multimodal and cross-cultural studies in real-world interactions.
Abstract
This paper investigates the performance of transformer-based architectures for person identification in natural, face-to-face conversation scenario. We implement and evaluate a two-stream framework that separately models spatial configurations and temporal motion patterns of 133 COCO WholeBody keypoints, extracted from a subset of the CANDOR conversational corpus. Our experiments compare pre-trained and from-scratch training, investigate the use of velocity features, and introduce a multi-scale temporal transformer for hierarchical motion modeling. Results demonstrate that domain-specific training significantly outperforms transfer learning, and that spatial configurations carry more discriminative information than temporal dynamics. The spatial transformer achieves 95.74% accuracy, while the multi-scale temporal transformer achieves 93.90%. Feature-level fusion pushes performance to 98.03%, confirming that postural and dynamic information are complementary. These findings highlight the potential of transformer architectures for person identification in natural interactions and provide insights for future multimodal and cross-cultural studies.
