Table of Contents
Fetching ...

M3GCLR: Multi-View Mini-Max Infinite Skeleton-Data Game Contrastive Learning For Skeleton-Based Action Recognition

Yanshan Li, Ke Ma, Miaomiao Wei, Linhui Dai

TL;DR

This work establishes the Infinite Skeleton-data Game (ISG) model and the ISG equilibrium theorem, and provides a rigorous proof, enabling mini-max optimization based on multi-view mutual information, and introduces the dual-loss equilibrium optimizer to optimize the game equilibrium.

Abstract

In recent years, contrastive learning has drawn significant attention as an effective approach to reducing reliance on labeled data. However, existing methods for self-supervised skeleton-based action recognition still face three major limitations: insufficient modeling of view discrepancies, lack of effective adversarial mechanisms, and uncontrollable augmentation perturbations. To tackle these issues, we propose the Multi-view Mini-Max infinite skeleton-data Game Contrastive Learning for skeleton-based action Recognition (M3GCLR), a game-theoretic contrastive framework. First, we establish the Infinite Skeleton-data Game (ISG) model and the ISG equilibrium theorem, and further provide a rigorous proof, enabling mini-max optimization based on multi-view mutual information. Then, we generate normal-extreme data pairs through multi-view rotation augmentation and adopt temporally averaged input as a neutral anchor to achieve structural alignment, thereby explicitly characterizing perturbation strength. Next, leveraging the proposed equilibrium theorem, we construct a strongly adversarial mini-max skeleton-data game to encourage the model to mine richer action-discriminative information. Finally, we introduce the dual-loss equilibrium optimizer to optimize the game equilibrium, allowing the learning process to maximize action-relevant information while minimizing encoding redundancy, and we prove the equivalence between the proposed optimizer and the ISG model. Extensive Experiments show that M3GCLR achieves three-stream 82.1%, 85.8% accuracy on NTU RGB+D 60 (X-Sub, X-View) and 72.3%, 75.0% accuracy on NTU RGB+D 120 (X-Sub, X-Set). On PKU-MMD Part I and II, it attains 89.1%, 45.2% in three-stream respectively, all results matching or outperforming state-of-the-art performance. Ablation studies confirm the effectiveness of each component.

M3GCLR: Multi-View Mini-Max Infinite Skeleton-Data Game Contrastive Learning For Skeleton-Based Action Recognition

TL;DR

This work establishes the Infinite Skeleton-data Game (ISG) model and the ISG equilibrium theorem, and provides a rigorous proof, enabling mini-max optimization based on multi-view mutual information, and introduces the dual-loss equilibrium optimizer to optimize the game equilibrium.

Abstract

In recent years, contrastive learning has drawn significant attention as an effective approach to reducing reliance on labeled data. However, existing methods for self-supervised skeleton-based action recognition still face three major limitations: insufficient modeling of view discrepancies, lack of effective adversarial mechanisms, and uncontrollable augmentation perturbations. To tackle these issues, we propose the Multi-view Mini-Max infinite skeleton-data Game Contrastive Learning for skeleton-based action Recognition (M3GCLR), a game-theoretic contrastive framework. First, we establish the Infinite Skeleton-data Game (ISG) model and the ISG equilibrium theorem, and further provide a rigorous proof, enabling mini-max optimization based on multi-view mutual information. Then, we generate normal-extreme data pairs through multi-view rotation augmentation and adopt temporally averaged input as a neutral anchor to achieve structural alignment, thereby explicitly characterizing perturbation strength. Next, leveraging the proposed equilibrium theorem, we construct a strongly adversarial mini-max skeleton-data game to encourage the model to mine richer action-discriminative information. Finally, we introduce the dual-loss equilibrium optimizer to optimize the game equilibrium, allowing the learning process to maximize action-relevant information while minimizing encoding redundancy, and we prove the equivalence between the proposed optimizer and the ISG model. Extensive Experiments show that M3GCLR achieves three-stream 82.1%, 85.8% accuracy on NTU RGB+D 60 (X-Sub, X-View) and 72.3%, 75.0% accuracy on NTU RGB+D 120 (X-Sub, X-Set). On PKU-MMD Part I and II, it attains 89.1%, 45.2% in three-stream respectively, all results matching or outperforming state-of-the-art performance. Ablation studies confirm the effectiveness of each component.
Paper Structure (20 sections, 3 theorems, 35 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 20 sections, 3 theorems, 35 equations, 10 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

If the polynomial function of mutual information functions serves as the utility function, and ISG $\Gamma_S = (E,(\boldsymbol{\uptheta}_i)_{i \in E}, (u_i(\boldsymbol{\uptheta}))_{i \in E})$ is defined on a bounded and closed set $\boldsymbol{\Theta}_i$, then the equilibrium of ISG exists.

Figures (10)

  • Figure 1: Pipeline of the proposed M3GCLR. The input sequence $\mathbf{X}^{(i)}$ is first processed by the Multi-view Rotation-based Augmentation Module, where the normal-augmentation Rotation Matrix, extreme-augmentation Rotation Matrix, and batch averaging are applied to generate a normally augmented data, an extremely augmented data, and the average data, respectively. These three views are fed into query encoder 1, query encoder 2, and the key encoder to obtain the feature embeddings through an MLP projection head. In the Mutual-information-based Mini-Max Infinite Skeleton-data Game Module, the mean mutual information among feature embeddings is computed to construct the utility functions of the ISG. By updating the encoder parameters to maximize the ISG utilities, strong adversarial feature learning is achieved. Finally, in the Dual-Loss-based Equilibrium Optimizer, the optimization is performed using both the $\mathcal{L}_{\mathrm{Push}}$ loss and the KL-divergence-based $\mathcal{MI}$ objective, forming the final loss $\mathcal{L}$. This process further optimizes the model parameters and ensures the convergence of the ISG.
  • Figure 2: Block diagram of MRAM. Rotation Matrices (RMs) around the $x$, $y$, $z$ axes $\mathbf{R}_x(\theta)$, $\mathbf{R}_y(\theta)$, $\mathbf{R}_z(\theta)$ are combined through matrix multiplication to obtain a multi-axis RM $\mathbf{R}_{xyz}(\theta)$. After the input skeleton sequence is processed by the Multi-view Rotation-based Augmentation Module, three transformed views are generated: (a) the average data $\bar{\mathbf{X}}^{(i)}$ obtained by batch averaging, (b) the normally augmented data $\hat{\mathbf{X}}^{(i)}$ produced by multiplying the input with the normal-augmented RM $\mathbf{R}_{xyz}(\theta_{normal})$, and (c) the extremely augmented data $\tilde{\mathbf{X}}^{(i)}$ derived from the extreme-augmented RM $\mathbf{R}_{xyz}(\theta_{extreme})$. These augmented sequences are then fed into the subsequent Mutual-information-based Mini-Max Infinite Skeleton-data Game Module to facilitate robust adversarial representation learning.
  • Figure 3: Visualizations of mean-motion sequences (the another isolated sequence represents another skeleton instance).
  • Figure 4: Block diagram of M3ISGM. The average data $\bar{\mathbf{X}}^{(i)}$, the normally augmented data $\hat{\mathbf{X}}^{(i)}$, and the extremely augmented data $\tilde{\mathbf{X}}^{(i)}$ produced by the Multi-view Rotation-based Augmentation Module are fed into the Mutual-information-based Mini-Max Infinite Skeleton-data Game Module (M3ISGM). After passing through the key encoder, query encoder 1 (encoder 1) and query encoder 2 (encoder 2), their respective feature representations $\bar{\mathbf{z}}^{(i)}$, $\hat{\mathbf{z}}^{(i)}$, and $\tilde{\mathbf{z}}^{(i)}$ are obtained. The mutual information between $\hat{\mathbf{z}}^{(i)}$ and $\bar{\mathbf{z}}^{(i)}$ is computed as $I_1$, and the mutual information between $\tilde{\mathbf{z}}^{(i)}$ and $\bar{\mathbf{z}}^{(i)}$ is computed as $I_2$. The squared difference between $I_1$ and $I_2$ is added to $I_1$ is obtained encoder 2's utility function $u_2$, while encoder 1's utility function is the negation of encoder 2's, i.e. $u_1 = -u_2$. Based on the obtained equilibrium solution, the parameters of query encoder 1 and query encoder 2 (i.e., $f_{q\_1}$ and $f_{q\_2}$) are updated accordingly, which further provides support for the equilibrium optimization based on dual-loss learning.
  • Figure 5: Block diagram of DLEO. After obtaining encoded features from the Mutual-information-based Mini-Max Infinite Skeleton-data Game Module, we feed them into the Dual-Loss-based Equilibrium Optimizer (DLEO). In DLEO, we first compute the distributions among the negative feature $\mathbf{m}_k$, the average feature $\bar{\mathbf{z}}^{(i)}$, the normally augmented feature $\hat{\mathbf{z}}^{(i)}$, and the extremely augmented feature $\tilde{\mathbf{z}}^{(i)}$, formulated as $p(\hat{\mathbf{z}}^{(i)}|\bar{\mathbf{z}}^{(i)})$, $p(\tilde{\mathbf{z}}^{(i)}|\bar{\mathbf{z}}^{(i)})$, $p(\hat{\mathbf{z}}^{(i)}|\mathbf{m}_k)$, and $p(\tilde{\mathbf{z}}^{(i)}|\mathbf{m}_k)$. Based on these distributions, the InfoNCE loss $\mathcal{L}_{\mathrm{Push}}$ and the KL divergence $\mathcal{MI}$ are calculated. Finally, the overall loss function $\mathcal{L}$ is computed from $\mathcal{L}_{\mathrm{Push}}$ and $\mathcal{MI}$, enabling multi-view Mini-Max game-driven contrastive learning to achieve more robust representation learning for skeleton-based action recognition.
  • ...and 5 more figures

Theorems & Definitions (8)

  • Definition 1: Mini-Max Game, M2Grapoport2012game
  • Definition 2: (Nash) equilibrium rapoport2012game
  • Definition 3: Infinite Skeleton-data Game, ISG
  • Theorem 1: Equilibrium Theorem for ISG
  • Theorem 2: Equilibrium Theorem for M2G rapoport2012game
  • proof : Proof of Theorem 1
  • Lemma 1: Kakutani Fixed Point Theorem kakutani1941
  • proof : Proof of the Equivalence between DLEO and M3ISGM