Table of Contents
Fetching ...

Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing Modalities

Maria Santos-Villafranca, Dustin Carrión-Ojeda, Alejandro Perez-Yus, Jesus Bermudez-Cameo, Jose J. Guerrero, Simone Schaub-Meyer

TL;DR

KARMMA distills knowledge from a multimodal teacher into a multimodal student that leverages all available modalities while remaining robust to missing ones, enabling deployment across diverse sensor configurations without retraining.

Abstract

Egocentric action recognition enables robots to facilitate human-robot interactions and monitor task progress. Existing methods often rely solely on RGB videos, although additional modalities, such as audio, can improve accuracy under challenging conditions. However, most multimodal approaches assume that all modalities are available at inference time, leading to significant accuracy drops, or even failure, when inputs are missing. To address this limitation, we introduce KARMMA, a multimodal Knowledge distillation framework for egocentric Action Recognition robust to Missing ModAlities that does not require modality alignment across all samples during training or inference. KARMMA distills knowledge from a multimodal teacher into a multimodal student that leverages all available modalities while remaining robust to missing ones, enabling deployment across diverse sensor configurations without retraining. Our student uses approximately 50% fewer computational resources than the teacher, resulting in a lightweight and fast model that is well suited for on-robot deployment. Experiments on Epic-Kitchens and Something-Something demonstrate that our student achieves competitive accuracy while significantly reducing performance degradation under missing modality conditions.

Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing Modalities

TL;DR

KARMMA distills knowledge from a multimodal teacher into a multimodal student that leverages all available modalities while remaining robust to missing ones, enabling deployment across diverse sensor configurations without retraining.

Abstract

Egocentric action recognition enables robots to facilitate human-robot interactions and monitor task progress. Existing methods often rely solely on RGB videos, although additional modalities, such as audio, can improve accuracy under challenging conditions. However, most multimodal approaches assume that all modalities are available at inference time, leading to significant accuracy drops, or even failure, when inputs are missing. To address this limitation, we introduce KARMMA, a multimodal Knowledge distillation framework for egocentric Action Recognition robust to Missing ModAlities that does not require modality alignment across all samples during training or inference. KARMMA distills knowledge from a multimodal teacher into a multimodal student that leverages all available modalities while remaining robust to missing ones, enabling deployment across diverse sensor configurations without retraining. Our student uses approximately 50% fewer computational resources than the teacher, resulting in a lightweight and fast model that is well suited for on-robot deployment. Experiments on Epic-Kitchens and Something-Something demonstrate that our student achieves competitive accuracy while significantly reducing performance degradation under missing modality conditions.

Paper Structure

This paper contains 12 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: KARMMA motivation. We introduce a novel multimodal-to-multimodal framework that leverages all available modalities while remaining robust when any are absent, eliminating the need for modality-aligned data. KARMMA produces a lightweight student that operates on any subset of the trained modalities, providing high flexibility and computational efficiency, making it suitable for edge and on-device deployment. Solid lines indicate available modalities, while dashed lines denote missing ones.
  • Figure 2: KARMMA training.Left (first stage): training the teacher. Right (second stage): distilling knowledge from the frozen teacher into the student. Both networks use modality dropout and our student includes our strategy for handling missing modalities (see \ref{['sec:karmma-enhancements']}) to remain robust when inputs are incomplete.
  • Figure 3: Missing modality strategy. To handle missing modalities, the embedding layer projects the tokens from the feature extractor when available. Then, it adds a learned modality token $\mathbf{\breve{t}}^m$ to all projected tokens. Finally, a learned token $\mathbf{\dot{t}}^m_i$ is added to each token.
  • Figure 4: Impact of run-time sensor dropouts. To emulate robotics deployments, we vary the modality dropout probability at inference from 0 % to 90 % and report action recognition accuracy. "Baseline" uses the same architecture as our student "KARMMA$_\text{S}$" (see \ref{['sec:teacher-and-student']}) but is trained end-to-end with cross-entropy loss and without the KARMMA enhancements (see \ref{['sec:karmma-enhancements']}), whereas "Baseline w/ $\delta$" incorporates modality dropout and our missing modality strategy.
  • Figure 5: Impact of modality dropout rate during training. Each bar shows the top-1 action recognition accuracy of our teacher (see \ref{['sec:teacher-and-student']}) trained with a specific modality dropout rate and evaluated across different modality combinations. "V," "F," and "A" denote RGB video, optical flow, and audio, respectively.
  • ...and 1 more figures