Table of Contents
Fetching ...

Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics

Masoumeh Chapariniya, Teodora Vukovic, Sarah Ebling, Volker Dellwo

TL;DR

This work addresses person identification from natural conversational dynamics using a transformer-based framework that operates on skeletal keypoints, avoiding appearance cues. It introduces a two-stream architecture with a spatial transformer for frame-level postures and a multi-scale temporal transformer for hierarchical motion, followed by feature-level fusion. Domain-specific training outperforms pretraining, with spatial cues achieving 95.74% accuracy, temporal dynamics 93.90% with multi-scale modeling, and fusion pushing accuracy to 98.03% on 114 speakers. The results demonstrate that conversational behavior contains distinctive identity signatures, enabling privacy-preserving biometrics and informing multimodal and cross-cultural studies in real-world interactions.

Abstract

This paper investigates the performance of transformer-based architectures for person identification in natural, face-to-face conversation scenario. We implement and evaluate a two-stream framework that separately models spatial configurations and temporal motion patterns of 133 COCO WholeBody keypoints, extracted from a subset of the CANDOR conversational corpus. Our experiments compare pre-trained and from-scratch training, investigate the use of velocity features, and introduce a multi-scale temporal transformer for hierarchical motion modeling. Results demonstrate that domain-specific training significantly outperforms transfer learning, and that spatial configurations carry more discriminative information than temporal dynamics. The spatial transformer achieves 95.74% accuracy, while the multi-scale temporal transformer achieves 93.90%. Feature-level fusion pushes performance to 98.03%, confirming that postural and dynamic information are complementary. These findings highlight the potential of transformer architectures for person identification in natural interactions and provide insights for future multimodal and cross-cultural studies.

Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics

TL;DR

This work addresses person identification from natural conversational dynamics using a transformer-based framework that operates on skeletal keypoints, avoiding appearance cues. It introduces a two-stream architecture with a spatial transformer for frame-level postures and a multi-scale temporal transformer for hierarchical motion, followed by feature-level fusion. Domain-specific training outperforms pretraining, with spatial cues achieving 95.74% accuracy, temporal dynamics 93.90% with multi-scale modeling, and fusion pushing accuracy to 98.03% on 114 speakers. The results demonstrate that conversational behavior contains distinctive identity signatures, enabling privacy-preserving biometrics and informing multimodal and cross-cultural studies in real-world interactions.

Abstract

This paper investigates the performance of transformer-based architectures for person identification in natural, face-to-face conversation scenario. We implement and evaluate a two-stream framework that separately models spatial configurations and temporal motion patterns of 133 COCO WholeBody keypoints, extracted from a subset of the CANDOR conversational corpus. Our experiments compare pre-trained and from-scratch training, investigate the use of velocity features, and introduce a multi-scale temporal transformer for hierarchical motion modeling. Results demonstrate that domain-specific training significantly outperforms transfer learning, and that spatial configurations carry more discriminative information than temporal dynamics. The spatial transformer achieves 95.74% accuracy, while the multi-scale temporal transformer achieves 93.90%. Feature-level fusion pushes performance to 98.03%, confirming that postural and dynamic information are complementary. These findings highlight the potential of transformer architectures for person identification in natural interactions and provide insights for future multimodal and cross-cultural studies.

Paper Structure

This paper contains 17 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The block diagram of the proposed method illustrates: 1) input video; 2) person detection and localization; 3) pose estimation using the Sapiens model; 4) keypoint sequence extraction; 5) spatial and temporal transformer processing; and 6) feature fusion for transformer-based identity identification.
  • Figure 2: Multi-Scale Temporal Transformer (MS-TTR): processes inputs at multiple temporal resolutions $(k=3, k=5)$, concatenates features, and optionally applies residual connections.
  • Figure 3: Spatial Transformer (STR) architecture: processes input keypoints through spatial self-attention blocks to learn flexible dependencies between body joints at each frame, followed by joint and temporal averaging before final classification.