Table of Contents
Fetching ...

Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy

Hoang-Quan Nguyen, Thanh-Dat Truong, Khoa Luu

TL;DR

This work tackles robust action recognition by enforcing attention consistency across viewpoints. It introduces a Directed Gromov-Wasserstein discrepancy to compare attention volumes from different views and leverages a NeRF-inspired rendering pipeline to implicitly generate novel-view features during training on single-view data. The approach combines a Video Swin Transformer backbone with a NeRF module that renders low-resolution attention volumes, guided by a cosine-based directed GW loss to align attention across views. Empirical results on Jester, Something-Something V2, and Kinetics-400 demonstrate state-of-the-art or competitive performance, with notable gains from novel-view training and DGW-based consistency. The method offers a principled, explainable way to ensure the model attends to the correct action subject, potentially improving reliability in real-world video understanding tasks.

Abstract

Action recognition has become one of the popular research topics in computer vision. There are various methods based on Convolutional Networks and self-attention mechanisms as Transformers to solve both spatial and temporal dimensions problems of action recognition tasks that achieve competitive performances. However, these methods lack a guarantee of the correctness of the action subject that the models give attention to, i.e., how to ensure an action recognition model focuses on the proper action subject to make a reasonable action prediction. In this paper, we propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos using Directed Gromov-Wasserstein Discrepancy. Furthermore, our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets. Therefore, the contributions in this work are three-fold. Firstly, we introduce the multi-view attention consistency to solve the problem of reasonable prediction in action recognition. Secondly, we define a new metric for multi-view consistent attention using Directed Gromov-Wasserstein Discrepancy. Thirdly, we built an action recognition model based on Video Transformers and Neural Radiance Fields. Compared to the recent action recognition methods, the proposed approach achieves state-of-the-art results on three large-scale datasets, i.e., Jester, Something-Something V2, and Kinetics-400.

Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy

TL;DR

This work tackles robust action recognition by enforcing attention consistency across viewpoints. It introduces a Directed Gromov-Wasserstein discrepancy to compare attention volumes from different views and leverages a NeRF-inspired rendering pipeline to implicitly generate novel-view features during training on single-view data. The approach combines a Video Swin Transformer backbone with a NeRF module that renders low-resolution attention volumes, guided by a cosine-based directed GW loss to align attention across views. Empirical results on Jester, Something-Something V2, and Kinetics-400 demonstrate state-of-the-art or competitive performance, with notable gains from novel-view training and DGW-based consistency. The method offers a principled, explainable way to ensure the model attends to the correct action subject, potentially improving reliability in real-world video understanding tasks.

Abstract

Action recognition has become one of the popular research topics in computer vision. There are various methods based on Convolutional Networks and self-attention mechanisms as Transformers to solve both spatial and temporal dimensions problems of action recognition tasks that achieve competitive performances. However, these methods lack a guarantee of the correctness of the action subject that the models give attention to, i.e., how to ensure an action recognition model focuses on the proper action subject to make a reasonable action prediction. In this paper, we propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos using Directed Gromov-Wasserstein Discrepancy. Furthermore, our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets. Therefore, the contributions in this work are three-fold. Firstly, we introduce the multi-view attention consistency to solve the problem of reasonable prediction in action recognition. Secondly, we define a new metric for multi-view consistent attention using Directed Gromov-Wasserstein Discrepancy. Thirdly, we built an action recognition model based on Video Transformers and Neural Radiance Fields. Compared to the recent action recognition methods, the proposed approach achieves state-of-the-art results on three large-scale datasets, i.e., Jester, Something-Something V2, and Kinetics-400.
Paper Structure (16 sections, 15 equations, 4 figures, 3 tables)

This paper contains 16 sections, 15 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The motivation of our method. Given an action subject with two different camera views, this work aims to ensure that the attention of the model is consistent.
  • Figure 2: The proposed action classification model. The input video is decomposed into patched and embedded. Then the embedded patches are computed via Transformer blocks. The representation vectors are mapped into the weights of MLP in the feature renderer. With the querying rays, the module renders feature vectors used for the last Transformer block before the classification.
  • Figure 3: The attention visualization of four frames (a) from two camera angles $\beta=-10^{\circ}$ and $\beta=10^{\circ}$ on the three settings: (b) without $\mathcal{L}_{\text{GW}}$ and $\mathcal{L}_{\text{DGW}}$, (c) with $\mathcal{L}_{\text{GW}}$, and (d) with $\mathcal{L}_{\text{DGW}}$.
  • Figure 4: The samples of three large-scale action recognition datasets: (a) Jester materzynska2019jester, (b) Something-Something V2 goyal2017something, and (c) Kinetics-400 kay2017kinetics.