Table of Contents
Fetching ...

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

Xinyuan Qian, Xinjia Zhu, Alessio Brutti, Dong Liang

Abstract

TTM (Talking to Me) task is a pivotal component in understanding human social interactions, aiming to determine who is engaged in conversation with the camera-wearer. Traditional models often face challenges in real-world scenarios due to missing visual data, neglecting the role of head orientation, and background noise. This study addresses these limitations by introducing EgoAdapt, an adaptive framework designed for robust egocentric "Talking to Me" speaker detection under missing modalities. Specifically, EgoAdapt incorporates three key modules: (1) a Visual Speaker Target Recognition (VSTR) module that captures head orientation as a non-verbal cue and lip movement as a verbal cue, allowing a comprehensive interpretation of both verbal and non-verbal signals to address TTM, setting it apart from tasks focused solely on detecting speaking status; (2) a Parallel Shared-weight Audio (PSA) encoder for enhanced audio feature extraction in noisy environments; and (3) a Visual Modality Missing Awareness (VMMA) module that estimates the presence or absence of each modality at each frame to adjust the system response dynamically.Comprehensive evaluations on the TTM benchmark of the Ego4D dataset demonstrate that EgoAdapt achieves a mean Average Precision (mAP) of 67.39% and an Accuracy (Acc) of 62.01%, significantly outperforming the state-of-the-art method by 4.96% in Accuracy and 1.56% in mAP.

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

Abstract

TTM (Talking to Me) task is a pivotal component in understanding human social interactions, aiming to determine who is engaged in conversation with the camera-wearer. Traditional models often face challenges in real-world scenarios due to missing visual data, neglecting the role of head orientation, and background noise. This study addresses these limitations by introducing EgoAdapt, an adaptive framework designed for robust egocentric "Talking to Me" speaker detection under missing modalities. Specifically, EgoAdapt incorporates three key modules: (1) a Visual Speaker Target Recognition (VSTR) module that captures head orientation as a non-verbal cue and lip movement as a verbal cue, allowing a comprehensive interpretation of both verbal and non-verbal signals to address TTM, setting it apart from tasks focused solely on detecting speaking status; (2) a Parallel Shared-weight Audio (PSA) encoder for enhanced audio feature extraction in noisy environments; and (3) a Visual Modality Missing Awareness (VMMA) module that estimates the presence or absence of each modality at each frame to adjust the system response dynamically.Comprehensive evaluations on the TTM benchmark of the Ego4D dataset demonstrate that EgoAdapt achieves a mean Average Precision (mAP) of 67.39% and an Accuracy (Acc) of 62.01%, significantly outperforming the state-of-the-art method by 4.96% in Accuracy and 1.56% in mAP.
Paper Structure (33 sections, 18 equations, 16 figures, 6 tables)

This paper contains 33 sections, 18 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 2: Overview of the EgoAdapt framework for egocentric interactive speaker detection. The VMMA module assesses modality availability and outputs $p_{vmm}$ to guide adaptation. The VSTR module extract non-verbal and verbal cues ($z_h$ and $z_l$) from head crops $D_h$ and lip crops $D_l$, respectively, while the PSA encoder processes clean and noisy audio ($D_a$ and $D_m$) to produce $z_a$. These cues are fused in the Visual-Audio Fusion module for the final TTM Prediction of "Talking to Me" or "Non Talking to Me".
  • Figure 3: VSTR module. It includes two components: the Head Pose Feature Extraction Module, which captures head orientation (non-verbal cues), and the Lip Feature Extraction Module, which extracts lip movement features (verbal cues).
  • Figure 4: PSA encoder. It processes clean speech $D_a$ and noise-mixed speech $D_m$ (ratio $\gamma$) into Mel spectrograms, encodes them with shared weights to obtain embeddings $z_a$ and $z_m$, and uses MSE loss to enforce noise-robust audio features.
  • Figure 5: VMMA module dynamically handles missing visual inputs by processing head crops $D_h$ and the initialized prompt $p_{\text{vmm}}$. It extracts fine-grained features by tracking frame presence and coarse-grained features by quantifying the missing frame ratio $\Delta_c$.
  • Figure 6: Oveview of Visual-Audio Fusion module. It fuses head pose $z_h$, lip motion $z_l$, and audio $z_a$ via Head-Lip, Lip-Audio, and Audio-Head cross-attention blocks. Aggregated features with VMMA prompt $p_{\mathrm{vmm}}$ are refined by self-attention for final "Talking to Me" prediction.
  • ...and 11 more figures