Table of Contents
Fetching ...

View-aware Cross-modal Distillation for Multi-view Action Recognition

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, Ichiro Ide

TL;DR

This work tackles multi-view action recognition in partially overlapping sensor setups where modalities and dense labels may be unavailable. It introduces ViCoKD, a knowledge-distillation framework that transfers supervision from a fully labeled multi-modal teacher to a constrained student using cross-modal attention, a cross-modal adapter, and a view-aware consistency loss based on human-detection masks and confidence-weighted Jensen–Shannon divergence. The approach achieves consistent, often surpassing, performance across backbones and environments on the MultiSensor-Home dataset, with notable gains under sequence-level supervision and partial view overlap. These findings demonstrate the practical value of explicitly modeling view-aware consistency for robust real-world multi-view recognition systems.

Abstract

The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints. It enforces prediction alignment when the action is co-visible across views, guided by human-detection masks and confidence-weighted Jensen-Shannon divergence between their predicted class distributions. Experiments on the real-world MultiSensor-Home dataset show that ViCoKD consistently outperforms competitive distillation methods across multiple backbones and environments, delivering significant gains and surpassing the teacher model under limited conditions.

View-aware Cross-modal Distillation for Multi-view Action Recognition

TL;DR

This work tackles multi-view action recognition in partially overlapping sensor setups where modalities and dense labels may be unavailable. It introduces ViCoKD, a knowledge-distillation framework that transfers supervision from a fully labeled multi-modal teacher to a constrained student using cross-modal attention, a cross-modal adapter, and a view-aware consistency loss based on human-detection masks and confidence-weighted Jensen–Shannon divergence. The approach achieves consistent, often surpassing, performance across backbones and environments on the MultiSensor-Home dataset, with notable gains under sequence-level supervision and partial view overlap. These findings demonstrate the practical value of explicitly modeling view-aware consistency for robust real-world multi-view recognition systems.

Abstract

The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints. It enforces prediction alignment when the action is co-visible across views, guided by human-detection masks and confidence-weighted Jensen-Shannon divergence between their predicted class distributions. Experiments on the real-world MultiSensor-Home dataset show that ViCoKD consistently outperforms competitive distillation methods across multiple backbones and environments, delivering significant gains and surpassing the teacher model under limited conditions.

Paper Structure

This paper contains 12 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the proposed ViCoKD method: (a) A multi-modal multi-view teacher with cross-modal attention, and (b) a knowledge distillation pipeline where the student trained using feature-level and logit-level distillation under view-aware consistency supervision.
  • Figure 2: Room layouts and sensor views for the MultiSensor-Home dataset nguyen2025multisensor used in the experiments. Each home environment is equipped with multiple RGB and Audio sensors, capturing scenes from different viewpoints with partial overlaps.
  • Figure 3: mAP [%] curves on the test set using the MultiASL nguyen2024action backbone under different distillation settings.
  • Figure 4: Qualitative comparison of attention maps for the teacher, baseline student, and the proposed ViCoKD method using the MultiASL nguyen2024action backbone on the MultiSensor-Home dataset nguyen2025multisensor. Each row corresponds to a different sensor view. ViCoKD produces more precise and human-centric attentions (yellow boxes) compared to the baseline student.