Table of Contents
Fetching ...

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

Baiyu Chen, Wilson Wongso, Zechen Li, Yonchanok Khaokaew, Hao Xue, Flora Salim

TL;DR

COMODO tackles the practical gap between high-performing, data-hungry video HAR models and energy-efficient, privacy-preserving IMU-based HAR. It transfers semantic knowledge from a frozen video encoder to a trainable IMU encoder via cross-modal self-supervised distillation, using a dynamic FIFO queue to align similarity distributions without labeled data. The approach demonstrates strong performance on Ego4D, EgoExo4D, and MMEA, often matching or exceeding fully supervised baselines and exhibiting robust cross-dataset generalization. Its model-agnostic design and simplicity make it a scalable framework for leveraging powerful video pretraining to enhance time-series sensing for on-device HAR systems. The results suggest substantial practical impact for energy-efficient, privacy-preserving HAR in real-world wearable applications.

Abstract

Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR). However, their high power consumption, privacy concerns, and dependence on lighting conditions limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient and privacy-preserving alternative, yet they suffer from limited large-scale annotated datasets, leading to weaker generalization in downstream tasks. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue, aligning the feature distributions of video and IMU embeddings. By distilling knowledge from video representations, our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets demonstrate that COMODO consistently improves downstream classification performance, achieving results comparable to or exceeding fully supervised fine-tuned models. Moreover, COMODO exhibits strong cross-dataset generalization. Benefiting from its simplicity, our method is also generally applicable to various video and time-series pre-trained models, offering the potential to leverage more powerful teacher and student foundation models in future research. The code is available at https://github.com/Breezelled/COMODO .

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

TL;DR

COMODO tackles the practical gap between high-performing, data-hungry video HAR models and energy-efficient, privacy-preserving IMU-based HAR. It transfers semantic knowledge from a frozen video encoder to a trainable IMU encoder via cross-modal self-supervised distillation, using a dynamic FIFO queue to align similarity distributions without labeled data. The approach demonstrates strong performance on Ego4D, EgoExo4D, and MMEA, often matching or exceeding fully supervised baselines and exhibiting robust cross-dataset generalization. Its model-agnostic design and simplicity make it a scalable framework for leveraging powerful video pretraining to enhance time-series sensing for on-device HAR systems. The results suggest substantial practical impact for energy-efficient, privacy-preserving HAR in real-world wearable applications.

Abstract

Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR). However, their high power consumption, privacy concerns, and dependence on lighting conditions limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient and privacy-preserving alternative, yet they suffer from limited large-scale annotated datasets, leading to weaker generalization in downstream tasks. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue, aligning the feature distributions of video and IMU embeddings. By distilling knowledge from video representations, our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets demonstrate that COMODO consistently improves downstream classification performance, achieving results comparable to or exceeding fully supervised fine-tuned models. Moreover, COMODO exhibits strong cross-dataset generalization. Benefiting from its simplicity, our method is also generally applicable to various video and time-series pre-trained models, offering the potential to leverage more powerful teacher and student foundation models in future research. The code is available at https://github.com/Breezelled/COMODO .

Paper Structure

This paper contains 18 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Motivation: Egocentric videos provide rich semantic information but are impractical for continuous on-device recognition, while IMU sensors are lightweight and energy-efficient yet lack large-scale training data. To bridge this gap, we propose cross-modal, self-supervised distillation to enhance IMU representations by leveraging video knowledge.
  • Figure 2: Overview of our cross-modal self-supervised distillation framework. The video encoder is pretrained and kept frozen, while the IMU encoder, initialized from a pretrained time-series model, is trained by minimizing the cross-entropy loss between the similarity distributions of video and IMU embeddings, which are computed based on a continuously updated instance queue.
  • Figure 3: Impact of queue size on accuracy across datasets.
  • Figure 4: Accuracy of distillation methods across datasets.