Table of Contents
Fetching ...

Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities

Aref Farhadipour, Teodora Vukovic, Volker Dellwo, Petr Motlicek, Srikanth Madikeri

TL;DR

The paper tackles modality loss in person recognition by proposing a Trimodal framework that fuses face, gesture, and voice through modality-specific encoders, cross-attention, and confidence-weighted fusion, coupled with a Mistake Correction module for refinement. It introduces confidence estimation and data augmentation to maintain performance under missing modalities, and validates on the new CANDOR dataset and VoxCeleb1, achieving state-of-the-art results (Trimodal: 99.18% on CANDOR; Bimodal VoxCeleb1: 99.92% top-1). The work highlights the complementary value of face and voice, the challenges of gesture under session variability, and demonstrates robust performance across unimodal, bimodal, and trimodal configurations with comprehensive ablations. Practical impact includes robust multimodal person recognition for real-world deployments where modalities can be degraded or unavailable, with publicly available code and data.

Abstract

Person recognition systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a Trimodal person identification framework that integrates voice, face, and gesture modalities, while remaining robust to modality loss. Our approach leverages multi-task learning to process each modality independently, followed by a cross-attention and gated fusion mechanisms to facilitate interaction across modalities. Moreover, a confidence-weighted fusion strategy dynamically adapts to missing and low-quality data, ensuring optimal classification even in Unimodal or Bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark for the first time. Our results demonstrate that the proposed Trimodal system achieves 99.18% Top-1 accuracy on person identification tasks, outperforming conventional Unimodal and late-fusion approaches. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in Bimodal mode. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.

Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities

TL;DR

The paper tackles modality loss in person recognition by proposing a Trimodal framework that fuses face, gesture, and voice through modality-specific encoders, cross-attention, and confidence-weighted fusion, coupled with a Mistake Correction module for refinement. It introduces confidence estimation and data augmentation to maintain performance under missing modalities, and validates on the new CANDOR dataset and VoxCeleb1, achieving state-of-the-art results (Trimodal: 99.18% on CANDOR; Bimodal VoxCeleb1: 99.92% top-1). The work highlights the complementary value of face and voice, the challenges of gesture under session variability, and demonstrates robust performance across unimodal, bimodal, and trimodal configurations with comprehensive ablations. Practical impact includes robust multimodal person recognition for real-world deployments where modalities can be degraded or unavailable, with publicly available code and data.

Abstract

Person recognition systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a Trimodal person identification framework that integrates voice, face, and gesture modalities, while remaining robust to modality loss. Our approach leverages multi-task learning to process each modality independently, followed by a cross-attention and gated fusion mechanisms to facilitate interaction across modalities. Moreover, a confidence-weighted fusion strategy dynamically adapts to missing and low-quality data, ensuring optimal classification even in Unimodal or Bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark for the first time. Our results demonstrate that the proposed Trimodal system achieves 99.18% Top-1 accuracy on person identification tasks, outperforming conventional Unimodal and late-fusion approaches. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in Bimodal mode. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.

Paper Structure

This paper contains 16 sections, 11 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Proposed Trimodal system. Consists of feeding each modality into its corresponding encoder and processing pathway. Subsequently, the outputs from these pathways are utilized within a Trimodal cross-attention block. Finally, all information is integrated into the decision-making block.
  • Figure 2: Distribution of Speakers Across Sessions in the CANDOR Dataset