Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy
Hoang-Quan Nguyen, Thanh-Dat Truong, Khoa Luu
TL;DR
This work tackles robust action recognition by enforcing attention consistency across viewpoints. It introduces a Directed Gromov-Wasserstein discrepancy to compare attention volumes from different views and leverages a NeRF-inspired rendering pipeline to implicitly generate novel-view features during training on single-view data. The approach combines a Video Swin Transformer backbone with a NeRF module that renders low-resolution attention volumes, guided by a cosine-based directed GW loss to align attention across views. Empirical results on Jester, Something-Something V2, and Kinetics-400 demonstrate state-of-the-art or competitive performance, with notable gains from novel-view training and DGW-based consistency. The method offers a principled, explainable way to ensure the model attends to the correct action subject, potentially improving reliability in real-world video understanding tasks.
Abstract
Action recognition has become one of the popular research topics in computer vision. There are various methods based on Convolutional Networks and self-attention mechanisms as Transformers to solve both spatial and temporal dimensions problems of action recognition tasks that achieve competitive performances. However, these methods lack a guarantee of the correctness of the action subject that the models give attention to, i.e., how to ensure an action recognition model focuses on the proper action subject to make a reasonable action prediction. In this paper, we propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos using Directed Gromov-Wasserstein Discrepancy. Furthermore, our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets. Therefore, the contributions in this work are three-fold. Firstly, we introduce the multi-view attention consistency to solve the problem of reasonable prediction in action recognition. Secondly, we define a new metric for multi-view consistent attention using Directed Gromov-Wasserstein Discrepancy. Thirdly, we built an action recognition model based on Video Transformers and Neural Radiance Fields. Compared to the recent action recognition methods, the proposed approach achieves state-of-the-art results on three large-scale datasets, i.e., Jester, Something-Something V2, and Kinetics-400.
