Table of Contents
Fetching ...

Cross-view Action Recognition Understanding From Exocentric to Egocentric Perspective

Thanh-Dat Truong, Khoa Luu

TL;DR

This work tackles cross-view action recognition by transferring knowledge from exocentric to egocentric videos, where egocentric data are scarce. It introduces CVAR, a Transformer-based framework that couples a geometry-informed cross-view constraint in self-attention with an unpaired cross-view self-attention loss, aligning video and attention distributions across views. Deep-feature distance and Jensen-Shannon divergence are employed as cross-view metrics, guided by a linear relation controlled by alpha and a bounded shift beta. Empirical results on Charades-Ego, EPIC-Kitchens-55/100, and NTU RGB+D demonstrate state-of-the-art performance and robustness to pairing settings and backbones, highlighting CVAR’s practical value for egocentric video understanding in low-data scenarios.

Abstract

Understanding action recognition in egocentric videos has emerged as a vital research topic with numerous practical applications. With the limitation in the scale of egocentric data collection, learning robust deep learning-based action recognition models remains difficult. Transferring knowledge learned from the large-scale exocentric data to the egocentric data is challenging due to the difference in videos across views. Our work introduces a novel cross-view learning approach to action recognition (CVAR) that effectively transfers knowledge from the exocentric to the selfish view. First, we present a novel geometric-based constraint into the self-attention mechanism in Transformer based on analyzing the camera positions between two views. Then, we propose a new cross-view self-attention loss learned on unpaired cross-view data to enforce the self-attention mechanism learning to transfer knowledge across views. Finally, to further improve the performance of our cross-view learning approach, we present the metrics to measure the correlations in videos and attention maps effectively. Experimental results on standard egocentric action recognition benchmarks, i.e., Charades-Ego, EPIC-Kitchens-55, and EPIC-Kitchens-100, have shown our approach's effectiveness and state-of-the-art performance.

Cross-view Action Recognition Understanding From Exocentric to Egocentric Perspective

TL;DR

This work tackles cross-view action recognition by transferring knowledge from exocentric to egocentric videos, where egocentric data are scarce. It introduces CVAR, a Transformer-based framework that couples a geometry-informed cross-view constraint in self-attention with an unpaired cross-view self-attention loss, aligning video and attention distributions across views. Deep-feature distance and Jensen-Shannon divergence are employed as cross-view metrics, guided by a linear relation controlled by alpha and a bounded shift beta. Empirical results on Charades-Ego, EPIC-Kitchens-55/100, and NTU RGB+D demonstrate state-of-the-art performance and robustness to pairing settings and backbones, highlighting CVAR’s practical value for egocentric video understanding in low-data scenarios.

Abstract

Understanding action recognition in egocentric videos has emerged as a vital research topic with numerous practical applications. With the limitation in the scale of egocentric data collection, learning robust deep learning-based action recognition models remains difficult. Transferring knowledge learned from the large-scale exocentric data to the egocentric data is challenging due to the difference in videos across views. Our work introduces a novel cross-view learning approach to action recognition (CVAR) that effectively transfers knowledge from the exocentric to the selfish view. First, we present a novel geometric-based constraint into the self-attention mechanism in Transformer based on analyzing the camera positions between two views. Then, we propose a new cross-view self-attention loss learned on unpaired cross-view data to enforce the self-attention mechanism learning to transfer knowledge across views. Finally, to further improve the performance of our cross-view learning approach, we present the metrics to measure the correlations in videos and attention maps effectively. Experimental results on standard egocentric action recognition benchmarks, i.e., Charades-Ego, EPIC-Kitchens-55, and EPIC-Kitchens-100, have shown our approach's effectiveness and state-of-the-art performance.
Paper Structure (11 sections, 13 equations, 4 figures, 9 tables)

This paper contains 11 sections, 13 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: The Cross-view Self-attention Constraints. Although under the setting of cross-view unpaired data where the corresponding video and its attention in the opposite view are inaccessible, our cross-view self-attention loss is proven to impose the cross-view constraints via unpaired samples based on the geometric properties between two camera positions.
  • Figure 2: The Proposed Framework. The input videos $\mathbf{x}_{exo}$ and $\mathbf{x}_{ego}$ are first forwarded to Transformer $F$ followed by the corresponding classifiers $C_{exo}$ and $C_{ego}$, respectively. Then, the supervised cross-entropy loss $\mathcal{L}_{ce}$ is applied to the predictions produced by the model. Meanwhile, the attention maps of video inputs, i.e., $\mathbf{a}_{exo}$ and $\mathbf{a}_{ego}$, are extracted and imposed by the cross-view self-attention loss $\mathcal{L}_{self}$.
  • Figure 3: Effectiveness of Our Metrics in Cross-view Learning
  • Figure 4: Attention Visualization of Model Prediction on EPIC Kitchen Videos.