Table of Contents
Fetching ...

AsyReC: A Multimodal Graph-based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification

Wang Tang, Fethiye Irmak Dogan, Linbo Qing, Hatice Gunes

TL;DR

AsyReC addresses asymmetric dyadic relationship classification from multimodal data by integrating a triplet graph neural network with node-edge dual attention, a clip-level learning strategy, and a periodic temporal encoder. The method preserves temporal continuity across uniformly segmented clips and explicitly models recurrent behavioral patterns through sinusoidal temporal embeddings. Empirical results on the NoXi and UDIVA datasets show state-of-the-art performance and robust handling of class imbalance, with ablation studies confirming the contribution of asymmetric interaction modeling and periodic encoding. The work advances socially intelligent systems by enabling more nuanced perception of bidirectional relationships and offers publicly available code for reproducibility.

Abstract

Dyadic social relationships, which refer to relationships between two individuals who know each other through repeated interactions (or not), are shaped by shared spatial and temporal experiences. Current computational methods for modeling these relationships face three major challenges: (1) the failure to model asymmetric relationships, e.g., one individual may perceive the other as a friend while the other perceives them as an acquaintance, (2) the disruption of continuous interactions by discrete frame sampling, which segments the temporal continuity of interaction in real-world scenarios, and (3) the limitation to consider periodic behavioral cues, such as rhythmic vocalizations or recurrent gestures, which are crucial for inferring the evolution of dyadic relationships. To address these challenges, we propose AsyReC, a multimodal graph-based framework for asymmetric dyadic relationship classification, with three core innovations: (i) a triplet graph neural network with node-edge dual attention that dynamically weights multimodal cues to capture interaction asymmetries (addressing challenge 1); (ii) a clip-level relationship learning architecture that preserves temporal continuity, enabling fine-grained modeling of real-world interaction dynamics (addressing challenge 2); and (iii) a periodic temporal encoder that projects time indices onto sine/cosine waveforms to model recurrent behavioral patterns (addressing challenge 3). Extensive experiments on two public datasets demonstrate state-of-the-art performance, while ablation studies validate the critical role of asymmetric interaction modeling and periodic temporal encoding in improving the robustness of dyadic relationship classification in real-world scenarios. Our code is publicly available at: https://github.com/tw-repository/AsyReC.

AsyReC: A Multimodal Graph-based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification

TL;DR

AsyReC addresses asymmetric dyadic relationship classification from multimodal data by integrating a triplet graph neural network with node-edge dual attention, a clip-level learning strategy, and a periodic temporal encoder. The method preserves temporal continuity across uniformly segmented clips and explicitly models recurrent behavioral patterns through sinusoidal temporal embeddings. Empirical results on the NoXi and UDIVA datasets show state-of-the-art performance and robust handling of class imbalance, with ablation studies confirming the contribution of asymmetric interaction modeling and periodic encoding. The work advances socially intelligent systems by enabling more nuanced perception of bidirectional relationships and offers publicly available code for reproducibility.

Abstract

Dyadic social relationships, which refer to relationships between two individuals who know each other through repeated interactions (or not), are shaped by shared spatial and temporal experiences. Current computational methods for modeling these relationships face three major challenges: (1) the failure to model asymmetric relationships, e.g., one individual may perceive the other as a friend while the other perceives them as an acquaintance, (2) the disruption of continuous interactions by discrete frame sampling, which segments the temporal continuity of interaction in real-world scenarios, and (3) the limitation to consider periodic behavioral cues, such as rhythmic vocalizations or recurrent gestures, which are crucial for inferring the evolution of dyadic relationships. To address these challenges, we propose AsyReC, a multimodal graph-based framework for asymmetric dyadic relationship classification, with three core innovations: (i) a triplet graph neural network with node-edge dual attention that dynamically weights multimodal cues to capture interaction asymmetries (addressing challenge 1); (ii) a clip-level relationship learning architecture that preserves temporal continuity, enabling fine-grained modeling of real-world interaction dynamics (addressing challenge 2); and (iii) a periodic temporal encoder that projects time indices onto sine/cosine waveforms to model recurrent behavioral patterns (addressing challenge 3). Extensive experiments on two public datasets demonstrate state-of-the-art performance, while ablation studies validate the critical role of asymmetric interaction modeling and periodic temporal encoding in improving the robustness of dyadic relationship classification in real-world scenarios. Our code is publicly available at: https://github.com/tw-repository/AsyReC.

Paper Structure

This paper contains 31 sections, 13 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Existing research paradigms: (a) Image-based SRR with an image from the PISC 11 dataset. (b) Video-based SRR with a video from the ViSR 21 dataset. (c) Asymmetric relationship, where the relationship from person A's perspective is different from that of person B. (d) Clip-level relationship learning. The screenshots in (c) and (d) are from the NoXi database collected for the ARIA-VALUSPA project 49. The abbreviations are Very good friend (Vgf), Friend (Fri), Stranger (Str), Acquaintance (Acq).
  • Figure 2: The overall framework of AsyReC. First, (a) a pair of videos is segmented into $n$ clips. (b) Each pair of clips is then processed to extract face, body, audio, and text features using dedicated encoders. (c) These features are structured into graph networks to model asymmetric interactions. (d) Simultaneously, the temporal signals are upsampled into high-dimensional embeddings, followed by sin/cos wave mapping. Finally, (e) the temporal embeddings, multimodal feature embeddings, and graph-inferred knowledge are concatenated for relationship classification. The screenshots are from the NoXi database collected for the ARIA-VALUSPA project 49.
  • Figure 3: Node-Edge Attention Graph Network (NE-AGN). The model sequentially computes (a) node attention, (b) edge attention, and (c) updated node representations.
  • Figure 4: Temporal Signal Modeling Framework. Given an input video pair, it is segmented into $T$ clip pairs, each processed through multimodal feature encoding, graph-based interaction inference, and temporal signal encoding. Encoded temporal signals, graph-inferred knowledge, and multimodal features are fused for relationship classification. Classification layer weights are shared across clips, enabling automatic learning of periodic dependencies via temporal embedding. The screenshots are from the NoXi database collected for the ARIA-VALUSPA project 49.
  • Figure 5: Confusion matrices for recognition results on NoXi: (a) PGCN, (b) LIReC, (c) CT and (d) AsyReC. Note: For better visualization, we compute the averages of NoXi-I and NoXi-J.
  • ...and 5 more figures