Table of Contents
Fetching ...

LLMs are Good Action Recognizers

Haoxuan Qu, Yujun Cai, Jun Liu

TL;DR

This work addresses skeleton-based action recognition by harnessing large language models without fine-tuning their weights. It introduces LLM-AR, which projects each skeleton sequence into an action sentence using an action-based VQ-VAE with a hyperbolic codebook and human inductive biases, enabling the LLM to perform recognition as an instruction-following task. Training occurs in two stages: first learning the action sentence tokenizer with reconstruction, embedding, commitment, and bias losses, then applying LoRA to align the LLM to the action sentences, while keeping the original weights fixed. The approach achieves state-of-the-art results on NTU RGB+D, NTU RGB+D 120, Toyota Smarthome, and UAV-Human, and ablations confirm the effectiveness of the proposed biases, discretization, hyperbolic coding, and LoRA tuning in leveraging rich knowledge embedded in the LLM. This framework demonstrates a practical path for integrating powerful language priors into cross-modal action recognition with strong generalization capabilities.

Abstract

Skeleton-based action recognition has attracted lots of research attention. Recently, to build an accurate skeleton-based action recognizer, a variety of works have been proposed. Among them, some works use large model architectures as backbones of their recognizers to boost the skeleton data representation capability, while some other works pre-train their recognizers on external data to enrich the knowledge. In this work, we observe that large language models which have been extensively used in various natural language processing tasks generally hold both large model architectures and rich implicit knowledge. Motivated by this, we propose a novel LLM-AR framework, in which we investigate treating the Large Language Model as an Action Recognizer. In our framework, we propose a linguistic projection process to project each input action signal (i.e., each skeleton sequence) into its ``sentence format'' (i.e., an ``action sentence''). Moreover, we also incorporate our framework with several designs to further facilitate this linguistic projection process. Extensive experiments demonstrate the efficacy of our proposed framework.

LLMs are Good Action Recognizers

TL;DR

This work addresses skeleton-based action recognition by harnessing large language models without fine-tuning their weights. It introduces LLM-AR, which projects each skeleton sequence into an action sentence using an action-based VQ-VAE with a hyperbolic codebook and human inductive biases, enabling the LLM to perform recognition as an instruction-following task. Training occurs in two stages: first learning the action sentence tokenizer with reconstruction, embedding, commitment, and bias losses, then applying LoRA to align the LLM to the action sentences, while keeping the original weights fixed. The approach achieves state-of-the-art results on NTU RGB+D, NTU RGB+D 120, Toyota Smarthome, and UAV-Human, and ablations confirm the effectiveness of the proposed biases, discretization, hyperbolic coding, and LoRA tuning in leveraging rich knowledge embedded in the LLM. This framework demonstrates a practical path for integrating powerful language priors into cross-modal action recognition with strong generalization capabilities.

Abstract

Skeleton-based action recognition has attracted lots of research attention. Recently, to build an accurate skeleton-based action recognizer, a variety of works have been proposed. Among them, some works use large model architectures as backbones of their recognizers to boost the skeleton data representation capability, while some other works pre-train their recognizers on external data to enrich the knowledge. In this work, we observe that large language models which have been extensively used in various natural language processing tasks generally hold both large model architectures and rich implicit knowledge. Motivated by this, we propose a novel LLM-AR framework, in which we investigate treating the Large Language Model as an Action Recognizer. In our framework, we propose a linguistic projection process to project each input action signal (i.e., each skeleton sequence) into its ``sentence format'' (i.e., an ``action sentence''). Moreover, we also incorporate our framework with several designs to further facilitate this linguistic projection process. Extensive experiments demonstrate the efficacy of our proposed framework.
Paper Structure (11 sections, 13 equations, 2 figures, 7 tables)

This paper contains 11 sections, 13 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Overview of our proposed LLM-AR framework. In our framework, given an input action signal, we first perform a linguistic projection process to acquire the corresponding "action sentence". We then perform action recognition via the large language model with its pre-trained weights untouched to keep its pre-learned rich knowledge.
  • Figure 2: Overview of the action-based VQ-VAE model with the hyperbolic codebook $C_H$ incorporated. Given a batch of input action signals $\{s^b_{1:V}\}_{b=1}^B$, to optimize the action-based VQ-VAE model, $\{s^b_{1:V}\}_{b=1}^B$ are first fed to the encoder $E$ to get the corresponding latent features $\{f^b_{1:W}\}_{b=1}^B$. Next, to leverage the hyperbolic codebook $C_H$ that can serve as a good representation of the tree-like human skeletons to perform quantization, $\{f^b_{1:W}\}_{b=1}^B$ are projected into the hyperbolic space via the process of E-to-H projection. After that, the quantization is performed in the hyperbolic space using $dist_{\mathbb{B}}(\widetilde{f}_w, c_u)$ defined in Eq. \ref{['eq:hyperbolic_distance']} as the distance function. Finally, after quantization, the discrete version of the latent features are passed back into the Euclidean space via the process of H-to-E projection to reconstruct the input action signals through the decoder $D$.