Table of Contents
Fetching ...

Exploring Missing Modality in Multimodal Egocentric Datasets

Merey Ramazanova, Alejandro Pardo, Humam Alwassel, Bernard Ghanem

TL;DR

This work tackles missing modalities in multimodal egocentric video understanding by introducing the Missing Modality Token (MMT), a learnable representation for absent inputs integrated into a Multimodal Bottleneck Transformer backbone. By training with modal-incomplete data and a random-replace strategy, MMT substantially mitigates test-time performance drops caused by missing modalities across Ego4D, Epic-Kitchens, and Epic-Sounds, reducing the drop from about 30 to roughly 10 percentage points at high missingness. The authors provide thorough ablations on fusion layer placement, training data composition, and comparisons to baselines and prompts-based methods, demonstrating robust performance under various incomplete-signal scenarios, including when both modalities are missing. Overall, the approach enables more resilient audiovisual egocentric models suitable for privacy, efficiency, and hardware-challenged real-world settings, with clear guidance on when and how to deploy MMT across datasets.

Abstract

Multimodal video understanding is crucial for analyzing egocentric videos, where integrating multiple sensory signals significantly enhances action recognition and moment localization. However, practical applications often grapple with incomplete modalities due to factors like privacy concerns, efficiency demands, or hardware malfunctions. Addressing this, our study delves into the impact of missing modalities on egocentric action recognition, particularly within transformer-based models. We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent, a strategy that proves effective in the Ego4D, Epic-Kitchens, and Epic-Sounds datasets. Our method mitigates the performance loss, reducing it from its original $\sim 30\%$ drop to only $\sim 10\%$ when half of the test set is modal-incomplete. Through extensive experimentation, we demonstrate the adaptability of MMT to different training scenarios and its superiority in handling missing modalities compared to current methods. Our research contributes a comprehensive analysis and an innovative approach, opening avenues for more resilient multimodal systems in real-world settings.

Exploring Missing Modality in Multimodal Egocentric Datasets

TL;DR

This work tackles missing modalities in multimodal egocentric video understanding by introducing the Missing Modality Token (MMT), a learnable representation for absent inputs integrated into a Multimodal Bottleneck Transformer backbone. By training with modal-incomplete data and a random-replace strategy, MMT substantially mitigates test-time performance drops caused by missing modalities across Ego4D, Epic-Kitchens, and Epic-Sounds, reducing the drop from about 30 to roughly 10 percentage points at high missingness. The authors provide thorough ablations on fusion layer placement, training data composition, and comparisons to baselines and prompts-based methods, demonstrating robust performance under various incomplete-signal scenarios, including when both modalities are missing. Overall, the approach enables more resilient audiovisual egocentric models suitable for privacy, efficiency, and hardware-challenged real-world settings, with clear guidance on when and how to deploy MMT across datasets.

Abstract

Multimodal video understanding is crucial for analyzing egocentric videos, where integrating multiple sensory signals significantly enhances action recognition and moment localization. However, practical applications often grapple with incomplete modalities due to factors like privacy concerns, efficiency demands, or hardware malfunctions. Addressing this, our study delves into the impact of missing modalities on egocentric action recognition, particularly within transformer-based models. We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent, a strategy that proves effective in the Ego4D, Epic-Kitchens, and Epic-Sounds datasets. Our method mitigates the performance loss, reducing it from its original drop to only when half of the test set is modal-incomplete. Through extensive experimentation, we demonstrate the adaptability of MMT to different training scenarios and its superiority in handling missing modalities compared to current methods. Our research contributes a comprehensive analysis and an innovative approach, opening avenues for more resilient multimodal systems in real-world settings.
Paper Structure (17 sections, 6 figures, 2 tables)

This paper contains 17 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Most commonly, we train the multimodal models on modal-complete data. These models (orange) fail when encountering modal-incomplete data at test time. Our proposed adaptation to the missing modality (green) significantly improves the performance across datasets. When all test inputs are modal-incomplete ($r_{test} = 100\%$), we surpass unimodal performance (purple) by 5 points in Epic-Kitchens, and double the baseline performance in Ego4D-AR.
  • Figure 2: Learning and Predicting with Missing Modalities.Left: Given modal-incomplete data, it is still unclear how to effectively train and predict with a multimodal model (we present some naive baseline methods in Sec. \ref{['subsec:baseline']}). Right: To address this issue, we introduce a Missing Modality Token (MMT). During training, MMT learns the representation of missing inputs from modal-incomplete samples and modal-complete samples. For the latter, we use random-replace to let the network observe the missing inputs and thus learn better representations (Sec. \ref{['subsec:ours']}). At test time, we replace the tokens of missing inputs with MMT to effectively represent them.
  • Figure 3: Modality drop probability $p$vs. accuracy for modal-complete Epic-Sounds, Epic-Kitchens and Ego4D-AR. In all datasets, our method dramatically improves the performance of the baseline (orange).
  • Figure 4: Results with the modal-incomplete training data. As Epic-Sounds does not naturally have missing modality in the training data, we manually remove the audio from (left) $r_{train} = 25\%$ and (right) $r_{train} = 50\%$ of samples in the train set.
  • Figure 5: Results on Epic-Sounds with $r_{train}^A = 25\%, r_{train}^V = 25\%$. We train our model with two MMTs: one for missing video and one for audio. We run the inference twice: (left) with missing video and (right) missing audio.
  • ...and 1 more figures