View-aware Cross-modal Distillation for Multi-view Action Recognition

Trung Thanh Nguyen; Yasutomo Kawanishi; Vijay John; Takahiro Komamizu; Ichiro Ide

View-aware Cross-modal Distillation for Multi-view Action Recognition

Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, Ichiro Ide

TL;DR

This work tackles multi-view action recognition in partially overlapping sensor setups where modalities and dense labels may be unavailable. It introduces ViCoKD, a knowledge-distillation framework that transfers supervision from a fully labeled multi-modal teacher to a constrained student using cross-modal attention, a cross-modal adapter, and a view-aware consistency loss based on human-detection masks and confidence-weighted Jensen–Shannon divergence. The approach achieves consistent, often surpassing, performance across backbones and environments on the MultiSensor-Home dataset, with notable gains under sequence-level supervision and partial view overlap. These findings demonstrate the practical value of explicitly modeling view-aware consistency for robust real-world multi-view recognition systems.

Abstract

The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints. It enforces prediction alignment when the action is co-visible across views, guided by human-detection masks and confidence-weighted Jensen-Shannon divergence between their predicted class distributions. Experiments on the real-world MultiSensor-Home dataset show that ViCoKD consistently outperforms competitive distillation methods across multiple backbones and environments, delivering significant gains and surpassing the teacher model under limited conditions.

View-aware Cross-modal Distillation for Multi-view Action Recognition

TL;DR

Abstract

View-aware Cross-modal Distillation for Multi-view Action Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)