Table of Contents
Fetching ...

Video Domain Incremental Learning for Human Action Recognition in Home Environments

Yuanda Hu, Xing Liu, Meiying Li, Yate Ge, Xiaohua Sun, Weiwei Guo

TL;DR

This paper formalizes Video Domain Incremental Learning (VDIL) for daily human action recognition in dynamic home environments, where the data distribution evolves but the action set remains fixed. It introduces a VDIL benchmark with three domain splits (user, scene, hybrid) drawn from NTU RGB+D, Toyota Smarthome, and ETRI-Activity3D-LivingLab, and proposes DRIFT, a replay-based baseline using reservoir sampling and a dual loss to balance learning new domains with retaining past knowledge. The optimization is framed as $\min_\theta \sum_{k=1}^T \mathcal{L}_k$, with $\mathcal{L}_k = \mathbb{E}_{(V_{k,j}, y_{k,j}) \sim \mathcal{D}_k, \mathcal{B}} [\ell(y_{k,j}, f_\theta(V_{k,j}))]$ and a knowledge-distillation term $\mathcal{L}_{KD}$, combined as $\mathcal{L} = \mathcal{L}_{class} + \lambda \mathcal{L}_{KD}$. Across three benchmarks, DRIFT and other replay-based methods outperform regularization baselines, and memory-constrained settings still approach the Joint upper bound, indicating strong robustness for home-action recognition under continual domain shifts. The work lays a foundation for practical, on-device VDIL systems and points to future directions such as few-shot VDIL to address label-scarcity in real-world deployments.

Abstract

It is significantly challenging to recognize daily human actions in homes due to the diversity and dynamic changes in unconstrained home environments. It spurs the need to continually adapt to various users and scenes. Fine-tuning current video understanding models on newly encountered domains often leads to catastrophic forgetting, where the models lose their ability to perform well on previously learned scenarios. To address this issue, we formalize the problem of Video Domain Incremental Learning (VDIL), which enables models to learn continually from different domains while maintaining a fixed set of action classes. Existing continual learning research primarily focuses on class-incremental learning, while the domain incremental learning has been largely overlooked in video understanding. In this work, we introduce a novel benchmark of domain incremental human action recognition for unconstrained home environments. We design three domain split types (user, scene, hybrid) to systematically assess the challenges posed by domain shifts in real-world home settings. Furthermore, we propose a baseline learning strategy based on replay and reservoir sampling techniques without domain labels to handle scenarios with limited memory and task agnosticism. Extensive experimental results demonstrate that our simple sampling and replay strategy outperforms most existing continual learning methods across the three proposed benchmarks.

Video Domain Incremental Learning for Human Action Recognition in Home Environments

TL;DR

This paper formalizes Video Domain Incremental Learning (VDIL) for daily human action recognition in dynamic home environments, where the data distribution evolves but the action set remains fixed. It introduces a VDIL benchmark with three domain splits (user, scene, hybrid) drawn from NTU RGB+D, Toyota Smarthome, and ETRI-Activity3D-LivingLab, and proposes DRIFT, a replay-based baseline using reservoir sampling and a dual loss to balance learning new domains with retaining past knowledge. The optimization is framed as , with and a knowledge-distillation term , combined as . Across three benchmarks, DRIFT and other replay-based methods outperform regularization baselines, and memory-constrained settings still approach the Joint upper bound, indicating strong robustness for home-action recognition under continual domain shifts. The work lays a foundation for practical, on-device VDIL systems and points to future directions such as few-shot VDIL to address label-scarcity in real-world deployments.

Abstract

It is significantly challenging to recognize daily human actions in homes due to the diversity and dynamic changes in unconstrained home environments. It spurs the need to continually adapt to various users and scenes. Fine-tuning current video understanding models on newly encountered domains often leads to catastrophic forgetting, where the models lose their ability to perform well on previously learned scenarios. To address this issue, we formalize the problem of Video Domain Incremental Learning (VDIL), which enables models to learn continually from different domains while maintaining a fixed set of action classes. Existing continual learning research primarily focuses on class-incremental learning, while the domain incremental learning has been largely overlooked in video understanding. In this work, we introduce a novel benchmark of domain incremental human action recognition for unconstrained home environments. We design three domain split types (user, scene, hybrid) to systematically assess the challenges posed by domain shifts in real-world home settings. Furthermore, we propose a baseline learning strategy based on replay and reservoir sampling techniques without domain labels to handle scenarios with limited memory and task agnosticism. Extensive experimental results demonstrate that our simple sampling and replay strategy outperforms most existing continual learning methods across the three proposed benchmarks.

Paper Structure

This paper contains 21 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Comparison of Video Class Incremental Learning (VCIL) and Video Domain Incremental Learning (VDIL). VCIL learns new classes incrementally with a fixed data distribution, while VDIL learns from evolving data distributions within a constant class set. (b) The bar plots show the presence of catastrophic forgetting in VDIL paradigm, indicated by the performance degradation on previous tasks when learning unseen domains sequentially.
  • Figure 2: Illustration of the Video Domain Incremental Learning (VDIL) benchmarks, where models learn from evolving data distributions while the action classes remain fixed. We split three datasets (NTU RGB+D, Toyota Smarthome, and ETRI-Activity3D-LivingLab) into different domain types: user, scene, and hybrid (combining user and scene), to create benchmarks that evaluate models' ability to incrementally adapt to domain shifts commonly encountered in home scenarios.
  • Figure 3: T-SNE visualization of feature distributions in the Toyota Smarthome datasettoyota. Colors represent distinct Scene Domains, each corresponding to different camera viewpoints. Contrary to expectations, samples cluster by domain rather than action class.
  • Figure 4: DRIFT leverages reservoir sampling to address the task-agnostic challenge, ensuring that each domain is equally likely to be represented in the memory. Furthermore, DRIFT incorporates a dual-loss strategy to maintain robust adaptability and memory retention across varied domains.
  • Figure 5: Performance comparison of continual learning methods across three proposed benchmarks. Panels (a), (b), and (c) illustrate the performance dynamics and final average accuracies of various algorithms on the NTU RGB+D, Toyota Smarthome, and ETRI-Activity3D-LivingLab datasets, respectively.