Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities
Aref Farhadipour, Teodora Vukovic, Volker Dellwo, Petr Motlicek, Srikanth Madikeri
TL;DR
The paper tackles modality loss in person recognition by proposing a Trimodal framework that fuses face, gesture, and voice through modality-specific encoders, cross-attention, and confidence-weighted fusion, coupled with a Mistake Correction module for refinement. It introduces confidence estimation and data augmentation to maintain performance under missing modalities, and validates on the new CANDOR dataset and VoxCeleb1, achieving state-of-the-art results (Trimodal: 99.18% on CANDOR; Bimodal VoxCeleb1: 99.92% top-1). The work highlights the complementary value of face and voice, the challenges of gesture under session variability, and demonstrates robust performance across unimodal, bimodal, and trimodal configurations with comprehensive ablations. Practical impact includes robust multimodal person recognition for real-world deployments where modalities can be degraded or unavailable, with publicly available code and data.
Abstract
Person recognition systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a Trimodal person identification framework that integrates voice, face, and gesture modalities, while remaining robust to modality loss. Our approach leverages multi-task learning to process each modality independently, followed by a cross-attention and gated fusion mechanisms to facilitate interaction across modalities. Moreover, a confidence-weighted fusion strategy dynamically adapts to missing and low-quality data, ensuring optimal classification even in Unimodal or Bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark for the first time. Our results demonstrate that the proposed Trimodal system achieves 99.18% Top-1 accuracy on person identification tasks, outperforming conventional Unimodal and late-fusion approaches. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in Bimodal mode. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.
