EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

Xinyuan Qian; Xinjia Zhu; Alessio Brutti; Dong Liang

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

Xinyuan Qian, Xinjia Zhu, Alessio Brutti, Dong Liang

Abstract

TTM (Talking to Me) task is a pivotal component in understanding human social interactions, aiming to determine who is engaged in conversation with the camera-wearer. Traditional models often face challenges in real-world scenarios due to missing visual data, neglecting the role of head orientation, and background noise. This study addresses these limitations by introducing EgoAdapt, an adaptive framework designed for robust egocentric "Talking to Me" speaker detection under missing modalities. Specifically, EgoAdapt incorporates three key modules: (1) a Visual Speaker Target Recognition (VSTR) module that captures head orientation as a non-verbal cue and lip movement as a verbal cue, allowing a comprehensive interpretation of both verbal and non-verbal signals to address TTM, setting it apart from tasks focused solely on detecting speaking status; (2) a Parallel Shared-weight Audio (PSA) encoder for enhanced audio feature extraction in noisy environments; and (3) a Visual Modality Missing Awareness (VMMA) module that estimates the presence or absence of each modality at each frame to adjust the system response dynamically.Comprehensive evaluations on the TTM benchmark of the Ego4D dataset demonstrate that EgoAdapt achieves a mean Average Precision (mAP) of 67.39% and an Accuracy (Acc) of 62.01%, significantly outperforming the state-of-the-art method by 4.96% in Accuracy and 1.56% in mAP.

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

Abstract

Paper Structure (33 sections, 18 equations, 16 figures, 6 tables)

This paper contains 33 sections, 18 equations, 16 figures, 6 tables.

INTRODUCTION
RELATED WORK
Audio-Visual Active Speaker Detection
Audio-Visual Social Interaction
METHODOLOGY
Problem Definition and Method Overview
Visual Speaker Target Recognition Module
Head Pose Feature Extraction Module
Lip Feature Extraction Module
Parallel Shared-weight Audio Encoder
Visual Modality Missing Awareness Module
Fine-grained
Coarse-grained
Visual-Audio Fusion Module
EXPERIMENTATION, RESULT AND ANALYSIS
...and 18 more sections

Figures (16)

Figure 2: Overview of the EgoAdapt framework for egocentric interactive speaker detection. The VMMA module assesses modality availability and outputs $p_{vmm}$ to guide adaptation. The VSTR module extract non-verbal and verbal cues ($z_h$ and $z_l$) from head crops $D_h$ and lip crops $D_l$, respectively, while the PSA encoder processes clean and noisy audio ($D_a$ and $D_m$) to produce $z_a$. These cues are fused in the Visual-Audio Fusion module for the final TTM Prediction of "Talking to Me" or "Non Talking to Me".
Figure 3: VSTR module. It includes two components: the Head Pose Feature Extraction Module, which captures head orientation (non-verbal cues), and the Lip Feature Extraction Module, which extracts lip movement features (verbal cues).
Figure 4: PSA encoder. It processes clean speech $D_a$ and noise-mixed speech $D_m$ (ratio $\gamma$) into Mel spectrograms, encodes them with shared weights to obtain embeddings $z_a$ and $z_m$, and uses MSE loss to enforce noise-robust audio features.
Figure 5: VMMA module dynamically handles missing visual inputs by processing head crops $D_h$ and the initialized prompt $p_{\text{vmm}}$. It extracts fine-grained features by tracking frame presence and coarse-grained features by quantifying the missing frame ratio $\Delta_c$.
Figure 6: Oveview of Visual-Audio Fusion module. It fuses head pose $z_h$, lip motion $z_l$, and audio $z_a$ via Head-Lip, Lip-Audio, and Audio-Head cross-attention blocks. Aggregated features with VMMA prompt $p_{\mathrm{vmm}}$ are refined by self-attention for final "Talking to Me" prediction.
...and 11 more figures

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

Abstract

EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

Authors

Abstract

Table of Contents

Figures (16)