Adaptive graph Kolmogorov-Arnold network for 3D human pose estimation
Abu Taib Mohammed Shahjahan, A. Ben Hamza
TL;DR
PoseKAN introduces an adaptive graph Kolmogorov-Arnold Network for 3D human pose estimation that replaces fixed-node activations with learnable edge-wise functions and leverages multi-hop propagation to capture long-range skeletal dependencies. By integrating a spectral modulation filter and a propagation scheme with $\mathbf{P}=(1-s)\hat{\mathbf{A}}+s\hat{\mathbf{A}}^{2}$, PoseKAN mitigates spectral bias and enhances expressiveness in 2D-to-3D lifting from a single image, using residual blocks and global response normalization. The model is trained with a combined $L_2$/$L_1$ loss and demonstrates competitive state-of-the-art performance on Human3.6M and strong generalization on MPI-INF-3DHP, while maintaining a compact parameter budget (~$5.72$M). These results indicate improved robustness to occlusions and depth ambiguities, with potential for extension to multi-person pose estimation and broader graph-based tasks. $L = \frac{1}{N} \Bigl[(1-\alpha) \sum_{i=1}^{N} \| \mathbf{y}_i - \hat{\mathbf{y}}_i \|_2^2 + \alpha \sum_{i=1}^{N} \| \mathbf{y}_i - \hat{\mathbf{y}}_i \|_1 \Bigr]$ demonstrates the elastic-net-inspired training objective.
Abstract
Graph convolutional network (GCN)-based methods have shown strong performance in 3D human pose estimation by leveraging the natural graph structure of the human skeleton. However, their local receptive field limits their ability to capture long-range dependencies essential for handling occlusions and depth ambiguities. They also exhibit spectral bias, which prioritizes low-frequency components while struggling to model high-frequency details. In this paper, we introduce PoseKAN, an adaptive graph Kolmogorov-Arnold Network (KAN), framework that extends KANs to graph-based learning for 2D-to-3D pose lifting from a single image. Unlike GCNs that use fixed activation functions, KANs employ learnable functions on graph edges, allowing data-driven, adaptive feature transformations. This enhances the model's adaptability and expressiveness, making it more expressive in learning complex pose variations. Our model employs multi-hop feature aggregation, ensuring the body joints can leverage information from both local and distant neighbors, leading to improved spatial awareness. It also incorporates residual PoseKAN blocks for deeper feature refinement, and a global response normalization for improved feature selectivity and contrast. Extensive experiments on benchmark datasets demonstrate the competitive performance of our model against state-of-the-art methods.
