Skeleton-based Action Recognition with Non-linear Dependency Modeling and Hilbert-Schmidt Independence Criterion
Yuheng Yang
TL;DR
This work tackles skeleton-based action recognition by addressing two key challenges: non-linear dependencies between distant joints and high-dimensional motion representations. It introduces a dependency refinement using a Gaussian correlation with kernel width $δ$ to augment the skeletal graph, coupled with a Hilbert-Schmidt Independence Criterion (HSIC) framework that maps augmented features into a Hilbert space via a Matérn kernel $k_{η}$ and optimizes a combined objective $\mathcal{L}_{Total}=\mathcal{L}_{cls}+\text{HSIC}(\hat{z},y)+\mathcal{L}_{CE}+\mathcal{L}_{D}$, where $\text{HSIC}(\hat{z},y)=\frac{1}{(n-1)^2}\operatorname{tr}(K_{\hat{z}} H K_y H)$. A multi-stream ensemble trains models with multiple kernel widths and input types (joint and bone) and averages predictions at inference. Empirically, the approach achieves state-of-the-art results on NTU RGB+D 60/120 and Northwestern-UCLA, with ablations confirming the contribution of HSIC, the distillation loss, the refined graph, and the ensemble. The work offers a robust, dimension-agnostic method for capturing complex, long-range skeletal dependencies and discriminative action representations, with potential for extension to few-shot, unsupervised, and cross-task contexts.
Abstract
Human skeleton-based action recognition has long been an indispensable aspect of artificial intelligence. Current state-of-the-art methods tend to consider only the dependencies between connected skeletal joints, limiting their ability to capture non-linear dependencies between physically distant joints. Moreover, most existing approaches distinguish action classes by estimating the probability density of motion representations, yet the high-dimensional nature of human motions invokes inherent difficulties in accomplishing such measurements. In this paper, we seek to tackle these challenges from two directions: (1) We propose a novel dependency refinement approach that explicitly models dependencies between any pair of joints, effectively transcending the limitations imposed by joint distance. (2) We further propose a framework that utilizes the Hilbert-Schmidt Independence Criterion to differentiate action classes without being affected by data dimensionality, and mathematically derive learning objectives guaranteeing precise recognition. Empirically, our approach sets the state-of-the-art performance on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA datasets.
