CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Tingbing Yan; Wenzheng Zeng; Yang Xiao; Xingyu Tong; Bo Tan; Zhiwen Fang; Zhiguo Cao; Joey Tianyi Zhou

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Tingbing Yan, Wenzheng Zeng, Yang Xiao, Xingyu Tong, Bo Tan, Zhiwen Fang, Zhiguo Cao, Joey Tianyi Zhou

TL;DR

To mitigate the asymmetry issue between the training and inference phases, this work designs a dual-branch architecture that allows the model to perform novel class inference without any text input, also making the additional inference cost neglectable compared with the base skeleton encoder.

Abstract

Most existing one-shot skeleton-based action recognition focuses on raw low-level information (e.g., joint location), and may suffer from local information loss and low generalization ability. To alleviate these, we propose to leverage text description generated from large language models (LLM) that contain high-level human knowledge, to guide feature learning, in a global-local-global way. Particularly, during training, we design $2$ prompts to gain global and local text descriptions of each action from an LLM. We first utilize the global text description to guide the skeleton encoder focus on informative joints (i.e.,global-to-local). Then we build non-local interaction between local text and joint features, to form the final global representation (i.e., local-to-global). To mitigate the asymmetry issue between the training and inference phases, we further design a dual-branch architecture that allows the model to perform novel class inference without any text input, also making the additional inference cost neglectable compared with the base skeleton encoder. Extensive experiments on three different benchmarks show that CrossGLG consistently outperforms the existing SOTA methods with large margins, and the inference cost (model size) is only $2.8$\% than the previous SOTA. CrossGLG can also serve as a plug-and-play module that can substantially enhance the performance of different SOTA skeleton encoders with a neglectable cost during inference. The source code will be released soon.

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

TL;DR

Abstract

prompts to gain global and local text descriptions of each action from an LLM. We first utilize the global text description to guide the skeleton encoder focus on informative joints (i.e.,global-to-local). Then we build non-local interaction between local text and joint features, to form the final global representation (i.e., local-to-global). To mitigate the asymmetry issue between the training and inference phases, we further design a dual-branch architecture that allows the model to perform novel class inference without any text input, also making the additional inference cost neglectable compared with the base skeleton encoder. Extensive experiments on three different benchmarks show that CrossGLG consistently outperforms the existing SOTA methods with large margins, and the inference cost (model size) is only

\% than the previous SOTA. CrossGLG can also serve as a plug-and-play module that can substantially enhance the performance of different SOTA skeleton encoders with a neglectable cost during inference. The source code will be released soon.

Paper Structure (23 sections, 7 equations, 7 figures, 8 tables)

This paper contains 23 sections, 7 equations, 7 figures, 8 tables.

Introduction
Related Work
Method
Preliminary
Architecture Overview
Derive Knowledgeable Action Descriptions from Large Language Model (LLM)
Cross-Modal Global-Lobal-Global Guidance
Global-to-Local Textual Guidance
Local-to-Global Cross-Modal Interaction.
Dual-Branch Architecture
Overall Learning Scheme
Experiment
Datasets and Settings
Comparison with state-of-the-art methods
Ablation Studies
...and 8 more sections

Figures (7)

Figure 1: (a) Main idea of the proposed CrossGLG: We propose to leverage text description generated from large language models (LLM) that contain high-level human knowledge to guide feature learning, in a global-local-global way. In global-to-local (block in green), the larger the radius of the circle around a joint, the more important that joint is. In local-to-global (block in blue), non-local interaction establishes connections between all textual features and all skeleton features at the joint level to summarize the high-level global action representation. (b) Performance comparison on NTU RGB+D 120 ntu120 dataset with 20 and 100 base classes: CrossGLG can serve as a plug-and-play module that can substantially enhance the performance of different SOTA skeleton encoders with a neglectable cost during inference..
Figure 2: The overall model architecture. JID denotes the Joint Importance Discrimination module that outputs the importance of each joint based on the skeleton features. We design a cross-modal guidance branch (colored with green) to guide the skeleton feature learning. During novel class inference, only the skeleton encoding branch (colored with blue) will be used without any textual input.
Figure 3: The paradigm to acquire knowledgeable action description texts through a large language model (ChatGPT GPT3 is used in our implementation).
Figure 4: Visualization of the output joint importance from JID. The upper half of the distribution of Ground Truth key joints is extracted from the global action description. The bottom half is the output of the Joint Importance Discrimination (JID) module.
Figure 5: Visual analysis of spatial attention within the skeleton encoder. In each matrix, each row represents the importance of all encoding blocks for each joint feature in an action. The deeper the color, the higher the importance.
...and 2 more figures

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

TL;DR

Abstract

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Authors

TL;DR

Abstract

Table of Contents

Figures (7)