Table of Contents
Fetching ...

Autoregressive Adaptive Hypergraph Transformer for Skeleton-based Activity Recognition

Abhisek Ray, Ayush Raj, Maheshkumar H. Kolekar

TL;DR

This work addresses skeleton-based action recognition by integrating higher-order hypergraph modeling with transformer-based sequence processing. It introduces AutoregAd-HGformer, which combines an autoregressive in-phase hypergraph encoder with a model-agnostic out-phase hypergraph decoder, enabling robust, action-dependent hyperedge formation and long-range dependency capture. The model uses joint, hyperedge, and bone cross-attention within a hierarchical FAHT structure, coupled with a hybrid supervised/self-supervised loss including a vector-quantization objective. Empirical results on NTU-60, NTU-120, and NW-UCLA demonstrate state-of-the-art performance, with ablations validating the contribution of each component and revealing favorable accuracy-complexity trade-offs. The approach advances skeleton-based recognition by effectively fusing multiscale semantics, higher-order relations, and adaptive hyperedge learning to handle diverse actions across viewpoints.

Abstract

Extracting multiscale contextual information and higher-order correlations among skeleton sequences using Graph Convolutional Networks (GCNs) alone is inadequate for effective action classification. Hypergraph convolution addresses the above issues but cannot harness the long-range dependencies. The transformer proves to be effective in capturing these dependencies and making complex contextual features accessible. We propose an Autoregressive Adaptive HyperGraph Transformer (AutoregAd-HGformer) model for in-phase (autoregressive and discrete) and out-phase (adaptive) hypergraph generation. The vector quantized in-phase hypergraph equipped with powerful autoregressive learned priors produces a more robust and informative representation suitable for hyperedge formation. The out-phase hypergraph generator provides a model-agnostic hyperedge learning technique to align the attributes with input skeleton embedding. The hybrid (supervised and unsupervised) learning in AutoregAd-HGformer explores the action-dependent feature along spatial, temporal, and channel dimensions. The extensive experimental results and ablation study indicate the superiority of our model over state-of-the-art hypergraph architectures on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.

Autoregressive Adaptive Hypergraph Transformer for Skeleton-based Activity Recognition

TL;DR

This work addresses skeleton-based action recognition by integrating higher-order hypergraph modeling with transformer-based sequence processing. It introduces AutoregAd-HGformer, which combines an autoregressive in-phase hypergraph encoder with a model-agnostic out-phase hypergraph decoder, enabling robust, action-dependent hyperedge formation and long-range dependency capture. The model uses joint, hyperedge, and bone cross-attention within a hierarchical FAHT structure, coupled with a hybrid supervised/self-supervised loss including a vector-quantization objective. Empirical results on NTU-60, NTU-120, and NW-UCLA demonstrate state-of-the-art performance, with ablations validating the contribution of each component and revealing favorable accuracy-complexity trade-offs. The approach advances skeleton-based recognition by effectively fusing multiscale semantics, higher-order relations, and adaptive hyperedge learning to handle diverse actions across viewpoints.

Abstract

Extracting multiscale contextual information and higher-order correlations among skeleton sequences using Graph Convolutional Networks (GCNs) alone is inadequate for effective action classification. Hypergraph convolution addresses the above issues but cannot harness the long-range dependencies. The transformer proves to be effective in capturing these dependencies and making complex contextual features accessible. We propose an Autoregressive Adaptive HyperGraph Transformer (AutoregAd-HGformer) model for in-phase (autoregressive and discrete) and out-phase (adaptive) hypergraph generation. The vector quantized in-phase hypergraph equipped with powerful autoregressive learned priors produces a more robust and informative representation suitable for hyperedge formation. The out-phase hypergraph generator provides a model-agnostic hyperedge learning technique to align the attributes with input skeleton embedding. The hybrid (supervised and unsupervised) learning in AutoregAd-HGformer explores the action-dependent feature along spatial, temporal, and channel dimensions. The extensive experimental results and ablation study indicate the superiority of our model over state-of-the-art hypergraph architectures on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets.

Paper Structure

This paper contains 30 sections, 24 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Model abstraction. Model-agnostic iterative hypergraph (left), various attention (middle) and AutoregAd-HGformer (right)
  • Figure 2: Proposed framework for autoregressive in-phase hypergraph quantizer (left) and adaptive hypergraph decoder (right).
  • Figure 3: t-SNE van2008visualizing of input features (left) and model output feature embeddings.
  • Figure 4: Misclassification between various ambiguous actions (Axx & Axx) of NTU RGB+D 60 dataset before and after implementing adaptive decoder. A11:reading, A12:writing, A29:play with the phone/tablet, A30:type on a keyboard.
  • Figure 5: Epoch-wise performance (accuracy in %) comparison of the proposed AutoregAd-HGformer for Left: transformer block counts ($L$), Middle: transformer channel counts ($c$), Right: hyperedge count ($k$). [on NTU RGB+D 60(X-sub)]
  • ...and 1 more figures