Table of Contents
Fetching ...

Test-Time Adaptation for Combating Missing Modalities in Egocentric Videos

Merey Ramazanova, Alejandro Pardo, Bernard Ghanem, Motasem Alfarra

TL;DR

This paper tackles the practical problem of missing modalities in multimodal egocentric video without retraining. It reframes the issue as test-time adaptation and introduces MiDl, which minimizes mutual information between the model output and the test-time modality while employing self-distillation to preserve performance when all modalities are present. The approach is architecture-, dataset-, and modality-agnostic, and it demonstrates consistent gains across Epic-Kitchens, Epic-Sounds, and Ego4D settings, including long-term adaptation and out-of-domain warm-up experiments. The findings highlight MiDl as a scalable, online solution for robust multimodal predictions under incomplete data, with a clear trade-off in computation that remains manageable through parallelization. Overall, MiDl advances practical robustness for multimodal vision by enabling effective test-time adaptation specifically for missing modalities in real-world scenarios.

Abstract

Understanding videos that contain multiple modalities is crucial, especially in egocentric videos, where combining various sensory inputs significantly improves tasks like action recognition and moment localization. However, real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues. Current methods, while effective, often necessitate retraining the model entirely to handle missing modalities, making them computationally intensive, particularly with large training datasets. In this study, we propose a novel approach to address this issue at test time without requiring retraining. We frame the problem as a test-time adaptation task, where the model adjusts to the available unlabeled data at test time. Our method, MiDl~(Mutual information with self-Distillation), encourages the model to be insensitive to the specific modality source present during testing by minimizing the mutual information between the prediction and the available modality. Additionally, we incorporate self-distillation to maintain the model's original performance when both modalities are available. MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time. Through experiments with various pretrained models and datasets, MiDl demonstrates substantial performance improvement without the need for retraining.

Test-Time Adaptation for Combating Missing Modalities in Egocentric Videos

TL;DR

This paper tackles the practical problem of missing modalities in multimodal egocentric video without retraining. It reframes the issue as test-time adaptation and introduces MiDl, which minimizes mutual information between the model output and the test-time modality while employing self-distillation to preserve performance when all modalities are present. The approach is architecture-, dataset-, and modality-agnostic, and it demonstrates consistent gains across Epic-Kitchens, Epic-Sounds, and Ego4D settings, including long-term adaptation and out-of-domain warm-up experiments. The findings highlight MiDl as a scalable, online solution for robust multimodal predictions under incomplete data, with a clear trade-off in computation that remains manageable through parallelization. Overall, MiDl advances practical robustness for multimodal vision by enabling effective test-time adaptation specifically for missing modalities in real-world scenarios.

Abstract

Understanding videos that contain multiple modalities is crucial, especially in egocentric videos, where combining various sensory inputs significantly improves tasks like action recognition and moment localization. However, real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues. Current methods, while effective, often necessitate retraining the model entirely to handle missing modalities, making them computationally intensive, particularly with large training datasets. In this study, we propose a novel approach to address this issue at test time without requiring retraining. We frame the problem as a test-time adaptation task, where the model adjusts to the available unlabeled data at test time. Our method, MiDl~(Mutual information with self-Distillation), encourages the model to be insensitive to the specific modality source present during testing by minimizing the mutual information between the prediction and the available modality. Additionally, we incorporate self-distillation to maintain the model's original performance when both modalities are available. MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time. Through experiments with various pretrained models and datasets, MiDl demonstrates substantial performance improvement without the need for retraining.
Paper Structure (35 sections, 5 equations, 4 figures, 12 tables)

This paper contains 35 sections, 5 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Test-Time Adaptation for missing modalities. The concept of test-time adaptation in the presence of missing data modalities focuses on a system where a stream of multimodal data is input, potentially lacking one or more modalities. Without adaptation, the pretrained model $f_{\theta_0}$ may predict inaccurate labels due to incomplete data. With test-time adaptation, the model is dynamically adjusted using the adaptation method $g$, resulting in an adapted model $f_{\theta_t}$, designed to handle the missing modalities and improve over time. The graph on the right illustrates the performance of the non-adapted baseline (blue) vs. the model adapted with our proposed adaptation method MiDl (green) on Epic-Kitchens dataset. It shows the adaptation efficacy in maintaining higher performance levels despite the variability in modal-completeness, surpassing the unimodal performance (orange) for all missing rates.
  • Figure 2: Adapting at test-time with MiDl. At test time, the stream reveals a sample. MiDl uses multimodal samples to adapt and requires one forward pass for each modality combination. MiDl leverages (KL) divergence to align the predictions of the adapted model $f_{\theta_t}$ with those of the original model $f_{\theta_0}$, ensuring that the adapted model does not deviate too far from the original model's predictions. The Mutual-Information (MI) component uses the prediction from the different modalities to reduce the dependency on any specific modality, fostering a more generalized and robust prediction across different modality combinations. MiDl updates the model for step $t+1$ using the combination of KL and MI in Equation \ref{['eq:adaptation_MiDl']}.
  • Figure 3: Qualitative analysis of MiDl's adaptation performance on Epic-Kitchens. The top two subfigures highlight positive cases where MiDl successfully adapts to predict the correct label (marked in green). Conversely, the bottom two subfigures illustrate negative cases (marked in red) where adaptation introduces errors.
  • Figure 4: Qualitative analysis of MiDl's adaptation performance on Epic-Sounds . The top two subfigures highlight positive cases where MiDl successfully adapts to predict the correct label (marked in green). Conversely, the bottom two subfigures illustrate negative cases (marked in red) where adaptation introduces errors.