Table of Contents
Fetching ...

Trunk-branch Contrastive Network with Multi-view Deformable Aggregation for Multi-view Action Recognition

Yingyuan Yang, Guoyuan Liang, Can Wang, Xiaojun Wu

TL;DR

This paper tackles RGB-based multi-view action recognition by proposing TBCNet, which first fuses multi-view features in a trunk block and then enriches the global representation with view-specific details through trunk-branch contrastive learning. The trunk utilizes Multi-view Deformable Aggregation (MVDA), composed of 3D Deformable Sampling, a Global Aggregation Module, and a Composite Relative Position Bias to model cross-view spatio-temporal correlations. A weighted trunk-branch contrastive loss aligns aggregated features with detailed per-view cues, emphasizing hard samples and subtle inter-class differences, and a two-stage training scheme preserves initial discriminative power while enabling cross-view enhancement. Evaluations on NTU-RGB+D 60/120, PKU-MMD, and N-UCLA show state-of-the-art RGB-based performance under several protocols, with the branch block removable at inference to reduce computation. Collectively, the approach advances robust cross-view action understanding by integrating global fusion with fine-grained view details and offers practical efficiency gains for real-world multi-view systems.

Abstract

Multi-view action recognition aims to identify actions in a given multi-view scene. Traditional studies initially extracted refined features from each view, followed by implemented paired interaction and integration, but they potentially overlooked the critical local features in each view. When observing objects from multiple perspectives, individuals typically form a comprehensive impression and subsequently fill in specific details. Drawing inspiration from this cognitive process, we propose a novel trunk-branch contrastive network (TBCNet) for RGB-based multi-view action recognition. Distinctively, TBCNet first obtains fused features in the trunk block and then implicitly supplements vital details provided by the branch block via contrastive learning, generating a more informative and comprehensive action representation. Within this framework, we construct two core components: the multi-view deformable aggregation and the trunk-branch contrastive learning. MVDA employed in the trunk block effectively facilitates multi-view feature fusion and adaptive cross-view spatio-temporal correlation, where a global aggregation module is utilized to emphasize significant spatial information and a composite relative position bias is designed to capture the intra- and cross-view relative positions. Moreover, a trunk-branch contrastive loss is constructed between aggregated features and refined details from each view. By incorporating two distinct weights for positive and negative samples, a weighted trunk-branch contrastive loss is proposed to extract valuable information and emphasize subtle inter-class differences. The effectiveness of TBCNet is verified by extensive experiments on four datasets including NTU-RGB+D 60, NTU-RGB+D 120, PKU-MMD, and N-UCLA dataset. Compared to other RGB-based methods, our approach achieves state-of-the-art performance in cross-subject and cross-setting protocols.

Trunk-branch Contrastive Network with Multi-view Deformable Aggregation for Multi-view Action Recognition

TL;DR

This paper tackles RGB-based multi-view action recognition by proposing TBCNet, which first fuses multi-view features in a trunk block and then enriches the global representation with view-specific details through trunk-branch contrastive learning. The trunk utilizes Multi-view Deformable Aggregation (MVDA), composed of 3D Deformable Sampling, a Global Aggregation Module, and a Composite Relative Position Bias to model cross-view spatio-temporal correlations. A weighted trunk-branch contrastive loss aligns aggregated features with detailed per-view cues, emphasizing hard samples and subtle inter-class differences, and a two-stage training scheme preserves initial discriminative power while enabling cross-view enhancement. Evaluations on NTU-RGB+D 60/120, PKU-MMD, and N-UCLA show state-of-the-art RGB-based performance under several protocols, with the branch block removable at inference to reduce computation. Collectively, the approach advances robust cross-view action understanding by integrating global fusion with fine-grained view details and offers practical efficiency gains for real-world multi-view systems.

Abstract

Multi-view action recognition aims to identify actions in a given multi-view scene. Traditional studies initially extracted refined features from each view, followed by implemented paired interaction and integration, but they potentially overlooked the critical local features in each view. When observing objects from multiple perspectives, individuals typically form a comprehensive impression and subsequently fill in specific details. Drawing inspiration from this cognitive process, we propose a novel trunk-branch contrastive network (TBCNet) for RGB-based multi-view action recognition. Distinctively, TBCNet first obtains fused features in the trunk block and then implicitly supplements vital details provided by the branch block via contrastive learning, generating a more informative and comprehensive action representation. Within this framework, we construct two core components: the multi-view deformable aggregation and the trunk-branch contrastive learning. MVDA employed in the trunk block effectively facilitates multi-view feature fusion and adaptive cross-view spatio-temporal correlation, where a global aggregation module is utilized to emphasize significant spatial information and a composite relative position bias is designed to capture the intra- and cross-view relative positions. Moreover, a trunk-branch contrastive loss is constructed between aggregated features and refined details from each view. By incorporating two distinct weights for positive and negative samples, a weighted trunk-branch contrastive loss is proposed to extract valuable information and emphasize subtle inter-class differences. The effectiveness of TBCNet is verified by extensive experiments on four datasets including NTU-RGB+D 60, NTU-RGB+D 120, PKU-MMD, and N-UCLA dataset. Compared to other RGB-based methods, our approach achieves state-of-the-art performance in cross-subject and cross-setting protocols.

Paper Structure

This paper contains 32 sections, 11 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: The architecture of different methods, including traditional multi-view methods, the method using disentangled representation learning and the proposed TBCNet.
  • Figure 2: The schematic of the key components in the work. (a) Multi-view deformable aggregation. For a given query, the module attends to important deformed points in the entire multi-view sequence. (b) Trunk-branch contrastive learning. The fused feature (the yellow circle) employs contrastive learning to extract discriminative details from positive (green circles) and negative samples (blue circles).
  • Figure 3: Outline of the proposed trunk-branch contrastive network. The trunk block employs multi-view deformable aggregation (MVDA) for the global feature fusion. And the trunk-branch contrastive learning facilitates the fused feature $\mathbf{Z}^G$ in absorbing the valid detailed information $\mathbf{Z}^B_v (v \in [1,V])$ from the branch block, yielding a comprehensive representation $\hat{\mathbf{Z}}^G$. The classifier in branch block assigns effective weights $w_n,w_p$ to contrastive samples, emphasizing subtle differences between samples.
  • Figure 4: The illustration of the proposed multi-view deformable aggregation.
  • Figure 5: Illustration of the trunk-branch contrastive learning. The green area bounded by the yellow circle represents the feature space of the anchor's class.
  • ...and 5 more figures