Autoregressive Adaptive Hypergraph Transformer for Skeleton-based Activity Recognition
Abhisek Ray, Ayush Raj, Maheshkumar H. Kolekar
TL;DR
This work addresses skeleton-based action recognition by integrating higher-order hypergraph modeling with transformer-based sequence processing. It introduces AutoregAd-HGformer, which combines an autoregressive in-phase hypergraph encoder with a model-agnostic out-phase hypergraph decoder, enabling robust, action-dependent hyperedge formation and long-range dependency capture. The model uses joint, hyperedge, and bone cross-attention within a hierarchical FAHT structure, coupled with a hybrid supervised/self-supervised loss including a vector-quantization objective. Empirical results on NTU-60, NTU-120, and NW-UCLA demonstrate state-of-the-art performance, with ablations validating the contribution of each component and revealing favorable accuracy-complexity trade-offs. The approach advances skeleton-based recognition by effectively fusing multiscale semantics, higher-order relations, and adaptive hyperedge learning to handle diverse actions across viewpoints.
Abstract
Extracting multiscale contextual information and higher-order correlations among skeleton sequences using Graph Convolutional Networks (GCNs) alone is inadequate for effective action classification. Hypergraph convolution addresses the above issues but cannot harness the long-range dependencies. The transformer proves to be effective in capturing these dependencies and making complex contextual features accessible. We propose an Autoregressive Adaptive HyperGraph Transformer (AutoregAd-HGformer) model for in-phase (autoregressive and discrete) and out-phase (adaptive) hypergraph generation. The vector quantized in-phase hypergraph equipped with powerful autoregressive learned priors produces a more robust and informative representation suitable for hyperedge formation. The out-phase hypergraph generator provides a model-agnostic hyperedge learning technique to align the attributes with input skeleton embedding. The hybrid (supervised and unsupervised) learning in AutoregAd-HGformer explores the action-dependent feature along spatial, temporal, and channel dimensions. The extensive experimental results and ablation study indicate the superiority of our model over state-of-the-art hypergraph architectures on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.
