Table of Contents
Fetching ...

Continual Multimodal Egocentric Activity Recognition via Modality-Aware Novel Detection

Wonseon Lim, Hyejeong Im, Dae-Won Kim

Abstract

Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary streams. Existing methods rely on the main logits for novelty scoring, without fully exploiting the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities, particularly IMU, remain underutilized, and this imbalance worsens over time under catastrophic forgetting. To address this, we propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning. At inference, Modality-aware Adaptive Scoring (MoAS) estimates sample-wise modality reliability from energy scores and adaptively integrates modality logits to better exploit complementary modality cues for novelty detection. During training, Modality-wise Representation Stabilization Training (MoRST) preserves modality-specific discriminability across tasks via auxiliary heads and modality-wise logit distillation. Experiments on a public multimodal egocentric benchmark show that MAND improves novel activity detection AUC by up to 10\% and known-class classification accuracy by up to 2.8\% over state-of-the-art baselines.

Continual Multimodal Egocentric Activity Recognition via Modality-Aware Novel Detection

Abstract

Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary streams. Existing methods rely on the main logits for novelty scoring, without fully exploiting the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities, particularly IMU, remain underutilized, and this imbalance worsens over time under catastrophic forgetting. To address this, we propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning. At inference, Modality-aware Adaptive Scoring (MoAS) estimates sample-wise modality reliability from energy scores and adaptively integrates modality logits to better exploit complementary modality cues for novelty detection. During training, Modality-wise Representation Stabilization Training (MoRST) preserves modality-specific discriminability across tasks via auxiliary heads and modality-wise logit distillation. Experiments on a public multimodal egocentric benchmark show that MAND improves novel activity detection AUC by up to 10\% and known-class classification accuracy by up to 2.8\% over state-of-the-art baselines.
Paper Structure (14 sections, 8 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 14 sections, 8 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Task-wise average AUC for novel activity detection in the mid-sequence setting. ER (All), which uses all modalities, closely tracks the RGB-only baseline, indicating that its main logits are largely dominated by RGB. In contrast, MAND maintains consistently higher performance by better exploiting modality-specific evidence.
  • Figure 2: Overview of our modality-aware framework for MMEA OWCL. Left (MoAS, inference): per-sample modality reliabilities are estimated from normalized energy scores and used to adaptively integrate weighted modality logits with the main logits, improving separability between known and novel activities. Right (MoRST, training): modality-specific heads and replay-based modality-logit distillation preserve per-modality decision boundaries across tasks. MoRST stabilizes modality-specific evidence over time, which MoAS then leverages to produce better-calibrated novelty scores.
  • Figure 3: Task-wise novel activity detection performance. Average AUC over the task stream for the best-performing method pair in each class-incremental setting.
  • Figure 4: Task-wise known-class classification performance. Average accuracy over the task stream for continual learning methods under each class-incremental setting.
  • Figure 5: Effect of different scoring strategies on novelty score separability for Task 2 in the short-sequence (8-class incremental) setting.