Table of Contents
Fetching ...

Skeleton-based Action Recognition with Non-linear Dependency Modeling and Hilbert-Schmidt Independence Criterion

Yuheng Yang

TL;DR

This work tackles skeleton-based action recognition by addressing two key challenges: non-linear dependencies between distant joints and high-dimensional motion representations. It introduces a dependency refinement using a Gaussian correlation with kernel width $δ$ to augment the skeletal graph, coupled with a Hilbert-Schmidt Independence Criterion (HSIC) framework that maps augmented features into a Hilbert space via a Matérn kernel $k_{η}$ and optimizes a combined objective $\mathcal{L}_{Total}=\mathcal{L}_{cls}+\text{HSIC}(\hat{z},y)+\mathcal{L}_{CE}+\mathcal{L}_{D}$, where $\text{HSIC}(\hat{z},y)=\frac{1}{(n-1)^2}\operatorname{tr}(K_{\hat{z}} H K_y H)$. A multi-stream ensemble trains models with multiple kernel widths and input types (joint and bone) and averages predictions at inference. Empirically, the approach achieves state-of-the-art results on NTU RGB+D 60/120 and Northwestern-UCLA, with ablations confirming the contribution of HSIC, the distillation loss, the refined graph, and the ensemble. The work offers a robust, dimension-agnostic method for capturing complex, long-range skeletal dependencies and discriminative action representations, with potential for extension to few-shot, unsupervised, and cross-task contexts.

Abstract

Human skeleton-based action recognition has long been an indispensable aspect of artificial intelligence. Current state-of-the-art methods tend to consider only the dependencies between connected skeletal joints, limiting their ability to capture non-linear dependencies between physically distant joints. Moreover, most existing approaches distinguish action classes by estimating the probability density of motion representations, yet the high-dimensional nature of human motions invokes inherent difficulties in accomplishing such measurements. In this paper, we seek to tackle these challenges from two directions: (1) We propose a novel dependency refinement approach that explicitly models dependencies between any pair of joints, effectively transcending the limitations imposed by joint distance. (2) We further propose a framework that utilizes the Hilbert-Schmidt Independence Criterion to differentiate action classes without being affected by data dimensionality, and mathematically derive learning objectives guaranteeing precise recognition. Empirically, our approach sets the state-of-the-art performance on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA datasets.

Skeleton-based Action Recognition with Non-linear Dependency Modeling and Hilbert-Schmidt Independence Criterion

TL;DR

This work tackles skeleton-based action recognition by addressing two key challenges: non-linear dependencies between distant joints and high-dimensional motion representations. It introduces a dependency refinement using a Gaussian correlation with kernel width to augment the skeletal graph, coupled with a Hilbert-Schmidt Independence Criterion (HSIC) framework that maps augmented features into a Hilbert space via a Matérn kernel and optimizes a combined objective , where . A multi-stream ensemble trains models with multiple kernel widths and input types (joint and bone) and averages predictions at inference. Empirically, the approach achieves state-of-the-art results on NTU RGB+D 60/120 and Northwestern-UCLA, with ablations confirming the contribution of HSIC, the distillation loss, the refined graph, and the ensemble. The work offers a robust, dimension-agnostic method for capturing complex, long-range skeletal dependencies and discriminative action representations, with potential for extension to few-shot, unsupervised, and cross-task contexts.

Abstract

Human skeleton-based action recognition has long been an indispensable aspect of artificial intelligence. Current state-of-the-art methods tend to consider only the dependencies between connected skeletal joints, limiting their ability to capture non-linear dependencies between physically distant joints. Moreover, most existing approaches distinguish action classes by estimating the probability density of motion representations, yet the high-dimensional nature of human motions invokes inherent difficulties in accomplishing such measurements. In this paper, we seek to tackle these challenges from two directions: (1) We propose a novel dependency refinement approach that explicitly models dependencies between any pair of joints, effectively transcending the limitations imposed by joint distance. (2) We further propose a framework that utilizes the Hilbert-Schmidt Independence Criterion to differentiate action classes without being affected by data dimensionality, and mathematically derive learning objectives guaranteeing precise recognition. Empirically, our approach sets the state-of-the-art performance on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA datasets.

Paper Structure

This paper contains 13 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The conceptual diagram illustrates the mapping process of feature representations from Euclidean space into Hilbert space. The mapping process is achieved through the kernel function.
  • Figure 2: The diagram illustrates the dependency refinement method. Specifically, we utilize the Gaussian correlation function to quantify dependencies between joints and incorporate them into the initial graph. By adjusting the kernel width in the Gaussian function, we could effectively capture the dependencies at both adjacent (orange) and distant (blue) scales.
  • Figure 3: The overall pipeline of our HSIC-based framework aims at recognizing the action classes of the motion sequences. For clarity, we only illustrate the pipeline using a single sequence $\mathcal{S}$ in this figure. The pipeline starts with refining joint dependencies, followed by extracting the motion features $z$ and $\tilde{z}$ from the base model and the auxiliary model, respectively. Subsequently, we feed $\tilde{z}$ to a classifier to obtain the auxiliary information $\tilde{y}$. In order to enhance the discriminative power of $z$, we incorporate $\tilde{y}$ into $z$ to obtain the augmented feature $\hat{z}$. We then engage a kernel function $k_\eta(\cdot)$ to transform $\hat{z}$ into Hilbert space and derive learning objectives, which effectively avoid the issue arising from the data dimensionality. The entire learning objective $\mathcal{L}_{Total}$ consists of $\mathcal{L}_{cls}$, HSIC, $\mathcal{L}_{CE}$, and $\mathcal{L}_{D}$.
  • Figure 4: The visualization illustrates feature representations of five randomly selected action classes. We utilize t-SNE for dimension reduction. Each action class is represented by a different color. The five action classes are Eat meal, Brush teeth, Brush hair, Hand clap, and Read book.
  • Figure 5: The histograms show the quantitative results of applying the three methods to thirty action classes. The quantitative results of each method are depicted with different colors.