Table of Contents
Fetching ...

DoGCLR: Dominance-Game Contrastive Learning Network for Skeleton-Based Action Recognition

Yanshan Li, Ke Ma, Miaomiao Wei, Linhui Dai

TL;DR

DoGCLR introduces a Dominance-Game framework for skeleton-based action recognition, modeling positive and negative sample construction as a joint game to balance semantic preservation and discriminative power. It couples a Spatio-temporal Dual-Weight Localization (DW-KRM) with Dual-scale Game-based Augmentation (DGA) for positive samples and an Entropy-driven Dominance Game Replacement Queue (EDGRQ) for negative samples, incorporating region-aware augmentations and entropy-based memory management. The approach achieves state-of-the-art or competitive results on NTU RGB+D 60/120 and PKU-MMD benchmarks, demonstrating improved motion-region modeling, hard-negative diversity, and robust generalization across views and setups. This work advances self-supervised skeleton action learning by integrating game-theoretic optimization with region-aware augmentations and entropy-driven memory strategies, enabling stronger representations for downstream recognition tasks.

Abstract

Existing self-supervised contrastive learning methods for skeleton-based action recognition often process all skeleton regions uniformly, and adopt a first-in-first-out (FIFO) queue to store negative samples, which leads to motion information loss and non-optimal negative sample selection. To address these challenges, this paper proposes Dominance-Game Contrastive Learning network for skeleton-based action Recognition (DoGCLR), a self-supervised framework based on game theory. DoGCLR models the construction of positive and negative samples as a dynamic Dominance Game, where both sample types interact to reach an equilibrium that balances semantic preservation and discriminative strength. Specifically, a spatio-temporal dual weight localization mechanism identifies key motion regions and guides region-wise augmentations to enhance motion diversity while maintaining semantics. In parallel, an entropy-driven dominance strategy manages the memory bank by retaining high entropy (hard) negatives and replacing low-entropy (weak) ones, ensuring consistent exposure to informative contrastive signals. Extensive experiments are conducted on NTU RGB+D and PKU-MMD datasets. On NTU RGB+D 60 X-Sub/X-View, DoGCLR achieves 81.1%/89.4% accuracy, and on NTU RGB+D 120 X-Sub/X-Set, DoGCLR achieves 71.2%/75.5% accuracy, surpassing state-of-the-art methods by 0.1%, 2.7%, 1.1%, and 2.3%, respectively. On PKU-MMD Part I/Part II, DoGCLR performs comparably to the state-of-the-art methods and achieves a 1.9% higher accuracy on Part II, highlighting its strong robustness on more challenging scenarios.

DoGCLR: Dominance-Game Contrastive Learning Network for Skeleton-Based Action Recognition

TL;DR

DoGCLR introduces a Dominance-Game framework for skeleton-based action recognition, modeling positive and negative sample construction as a joint game to balance semantic preservation and discriminative power. It couples a Spatio-temporal Dual-Weight Localization (DW-KRM) with Dual-scale Game-based Augmentation (DGA) for positive samples and an Entropy-driven Dominance Game Replacement Queue (EDGRQ) for negative samples, incorporating region-aware augmentations and entropy-based memory management. The approach achieves state-of-the-art or competitive results on NTU RGB+D 60/120 and PKU-MMD benchmarks, demonstrating improved motion-region modeling, hard-negative diversity, and robust generalization across views and setups. This work advances self-supervised skeleton action learning by integrating game-theoretic optimization with region-aware augmentations and entropy-driven memory strategies, enabling stronger representations for downstream recognition tasks.

Abstract

Existing self-supervised contrastive learning methods for skeleton-based action recognition often process all skeleton regions uniformly, and adopt a first-in-first-out (FIFO) queue to store negative samples, which leads to motion information loss and non-optimal negative sample selection. To address these challenges, this paper proposes Dominance-Game Contrastive Learning network for skeleton-based action Recognition (DoGCLR), a self-supervised framework based on game theory. DoGCLR models the construction of positive and negative samples as a dynamic Dominance Game, where both sample types interact to reach an equilibrium that balances semantic preservation and discriminative strength. Specifically, a spatio-temporal dual weight localization mechanism identifies key motion regions and guides region-wise augmentations to enhance motion diversity while maintaining semantics. In parallel, an entropy-driven dominance strategy manages the memory bank by retaining high entropy (hard) negatives and replacing low-entropy (weak) ones, ensuring consistent exposure to informative contrastive signals. Extensive experiments are conducted on NTU RGB+D and PKU-MMD datasets. On NTU RGB+D 60 X-Sub/X-View, DoGCLR achieves 81.1%/89.4% accuracy, and on NTU RGB+D 120 X-Sub/X-Set, DoGCLR achieves 71.2%/75.5% accuracy, surpassing state-of-the-art methods by 0.1%, 2.7%, 1.1%, and 2.3%, respectively. On PKU-MMD Part I/Part II, DoGCLR performs comparably to the state-of-the-art methods and achieves a 1.9% higher accuracy on Part II, highlighting its strong robustness on more challenging scenarios.

Paper Structure

This paper contains 24 sections, 13 equations, 7 figures, 7 tables, 2 algorithms.

Figures (7)

  • Figure 1: Pipeline of DoGCLR. The positive sample optimization part consists of DW-KRM and DGA, which locate the key motion regions of samples and perform partitioned data augmentation to explore richer motion patterns while maintaining semantic consistency. The negative sample selection part is composed of EDGRQ, which improves the value of negative samples by retaining hard negatives. DW-KRM, DGA and EDGRQ together constitute the dual-dimensional dominance system of “positive sample optimization and negative sample selection”.
  • Figure 2: Block diagram of DW-KRM. The data and Global Statistical Benchmark Pose (GSBP) are encoded by the key encoder $f_k(\cdot)$ and MLP $g_k(\cdot)$ to compute their respective features. The Discrepancy-Degree (DD) is then fed into the Joint-Degree Activation Module (JDAM), where it is multiplied by the Joint-Degree (JD) calculated within JDAM to produce the spatio-temporal composite weight $\alpha_{c}^{(i)}$. This output from JDAM provides guidance for subsequent partitioned data augmentation. $\bigotimes$ denotes matrices multiplication.
  • Figure 3: Block diagram of DGA. DGA first preserves the content of nodes in $X^{(i)}$ with higher Joint-Degree (JD) — that is, the spatial expectation and temporal variance of $X^{(i)}$ — and transfers the style (spatial expectation and temporal variance) of another sample $X^{(j)}$ to the nodes in $X^{(i)}$ with lower JD. Subsequently, according to the key motion region mask $A_{tv}^{(i)}$ output by DW-KRM, strong augmentations are applied to the key motion regions, while normal augmentations are applied to the non-key regions, thereby generating the partitioned augmented sequence $X_{Final}^{(i)}$.
  • Figure 4: Confusion matrices of AimCLR and DoGCLR under the linear evaluation protocol on the NTU RGB+D 60 dataset, visualized for the Joint, Motion, and Bone streams. Each matrix is computed from 100 samples per class across 10 actions. DoGCLR consistently demonstrates higher recognition accuracy and reduced misclassification.
  • Figure 5: Comparison results of the 3s-baseline methods and 3s-DoGCLR on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD Part I datasets under the linear evaluation protocol. 3s-DoGCLR achieves consistently superior performance across all methods.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Definition 1: Dominance Game, DG
  • Definition 2: Global Statistical Benchmark Pose, GSBP
  • Definition 3: Discrepancy-Degree, DD
  • Definition 4: Joint-Degree, JD