Table of Contents
Fetching ...

HyLiFormer: Hyperbolic Linear Attention for Skeleton-based Human Action Recognition

Yue Li, Haoxuan Qu, Mengyuan Liu, Jun Liu, Yujun Cai

TL;DR

HyLiFormer addresses the challenge of quadratic computational cost in Transformer-based skeleton HAR by introducing a hyperbolic linear attention framework. It maps Euclidean skeleton data into the Poincaré model via the Hyperbolic Transformation with Curvatures (HTC) and performs attention with Hyperbolic Linear Attention (HLA), achieving a theoretical and practical complexity of $O(N)$ while modeling hierarchical joint structures. Empirical results on NTU RGB+D and NTU RGB+D 120 show competitive accuracy (e.g., ~87.5% on X-Sub120) with substantially reduced training time, and ablations identify $\kappa=-1$ as optimal and highlight the limitations of directly applying Euclidean linear attention in skeleton HAR. The work provides a scalable, geometry-aware transformer for real-world HAR applications, demonstrating the benefits of combining hyperbolic geometry with linear attention for hierarchical sequence data.

Abstract

Transformers have demonstrated remarkable performance in skeleton-based human action recognition, yet their quadratic computational complexity remains a bottleneck for real-world applications. To mitigate this, linear attention mechanisms have been explored but struggle to capture the hierarchical structure of skeleton data. Meanwhile, the Poincaré model, as a typical hyperbolic geometry, offers a powerful framework for modeling hierarchical structures but lacks well-defined operations for existing mainstream linear attention. In this paper, we propose HyLiFormer, a novel hyperbolic linear attention Transformer tailored for skeleton-based action recognition. Our approach incorporates a Hyperbolic Transformation with Curvatures (HTC) module to map skeleton data into hyperbolic space and a Hyperbolic Linear Attention (HLA) module for efficient long-range dependency modeling. Theoretical analysis and extensive experiments on NTU RGB+D and NTU RGB+D 120 datasets demonstrate that HyLiFormer significantly reduces computational complexity while preserving model accuracy, making it a promising solution for efficiency-critical applications.

HyLiFormer: Hyperbolic Linear Attention for Skeleton-based Human Action Recognition

TL;DR

HyLiFormer addresses the challenge of quadratic computational cost in Transformer-based skeleton HAR by introducing a hyperbolic linear attention framework. It maps Euclidean skeleton data into the Poincaré model via the Hyperbolic Transformation with Curvatures (HTC) and performs attention with Hyperbolic Linear Attention (HLA), achieving a theoretical and practical complexity of while modeling hierarchical joint structures. Empirical results on NTU RGB+D and NTU RGB+D 120 show competitive accuracy (e.g., ~87.5% on X-Sub120) with substantially reduced training time, and ablations identify as optimal and highlight the limitations of directly applying Euclidean linear attention in skeleton HAR. The work provides a scalable, geometry-aware transformer for real-world HAR applications, demonstrating the benefits of combining hyperbolic geometry with linear attention for hierarchical sequence data.

Abstract

Transformers have demonstrated remarkable performance in skeleton-based human action recognition, yet their quadratic computational complexity remains a bottleneck for real-world applications. To mitigate this, linear attention mechanisms have been explored but struggle to capture the hierarchical structure of skeleton data. Meanwhile, the Poincaré model, as a typical hyperbolic geometry, offers a powerful framework for modeling hierarchical structures but lacks well-defined operations for existing mainstream linear attention. In this paper, we propose HyLiFormer, a novel hyperbolic linear attention Transformer tailored for skeleton-based action recognition. Our approach incorporates a Hyperbolic Transformation with Curvatures (HTC) module to map skeleton data into hyperbolic space and a Hyperbolic Linear Attention (HLA) module for efficient long-range dependency modeling. Theoretical analysis and extensive experiments on NTU RGB+D and NTU RGB+D 120 datasets demonstrate that HyLiFormer significantly reduces computational complexity while preserving model accuracy, making it a promising solution for efficiency-critical applications.

Paper Structure

This paper contains 17 sections, 2 theorems, 13 equations, 2 figures, 4 tables.

Key Result

lemma 1

Given an input skeleton data point $\mathbf{x} \in \mathbb{R}^{T \times V \times M \times C_{\text{in}}}$, the transformation applied by the HTC module ensures that the output $\mathbf{x^{\mathbb{B}}_{\kappa}}$ satisfies the Poincaré model constraint, i.e., $\|\mathbf{x^{\mathbb{B}}_{\kappa}}\| < -\

Figures (2)

  • Figure 1: (a) The Process of Softmax Attention. The final attention matrix is computed by first multiplying $Q$ and $K^T$, and then multiplying the result with $V$. Each row in the $QKV$ matrix (denoted by the slash in the figure) represents a temporal or spatial feature. It is evident that the computational complexity of Softmax Attention is $\mathcal{O}(N^2)$. (b) The Curve of Computational Complexity Growth with Feature Sequence Length. As the sequence length increases, Softmax Attention exhibits quadratic growth ($\mathcal{O}(N^2)$), whereas Linear Attention achieves significantly lower computational overhead with linear growth ($\mathcal{O}(N)$).
  • Figure 2: Framework of HyLiFormer. The input data (skeleton data) is projected onto the Poincaré model through Hyperbolic Transformation with Curvatures (HTC). The transformed data then passes through the hyperbolic linear attention block, which captures the temporal and hierarchical information of the skeleton data. Finally, the data is mapped back to Euclidean space using the inverse of the HTC, which is omitted in the diagram for simplicity.

Theorems & Definitions (2)

  • lemma 1
  • lemma 2