Table of Contents
Fetching ...

POET: Prompt Offset Tuning for Continual Human Action Adaptation

Prachi Garg, Joseph K J, Vineeth N Balasubramanian, Necati Cihan Camgoz, Chengde Wan, Kenrick Kin, Weiguang Si, Shugao Ma, Fernando De La Torre

TL;DR

POET addresses privacy-aware, few-shot continual action recognition for skeleton-based HAR on XR devices by learning spatio-temporal prompt offsets while keeping the backbone frozen and avoiding storage of past data. The method introduces a shared prompt pool, input-conditioned prompt selection, and coupled optimization to update prompts, keys, and the query adaptor, enabling efficient adaptation to new action classes with minimal data. Empirical results on NTU RGB+D and SHREC-2017 show POET achieves state-of-the-art stability-plasticity trade-offs (high $A_{HM}$) compared to prompt-based and traditional continual baselines, with a data-free paradigm that reduces memory and privacy risks. The approach demonstrates strong forward transfer and robust performance across graph CNN and graph transformer backbones, suggesting practical utility for deploying privacy-preserving, personalized XR action recognition systems.

Abstract

As extended reality (XR) is redefining how users interact with computing devices, research in human action recognition is gaining prominence. Typically, models deployed on immersive computing devices are static and limited to their default set of classes. The goal of our research is to provide users and developers with the capability to personalize their experience by adding new action classes to their device models continually. Importantly, a user should be able to add new classes in a low-shot and efficient manner, while this process should not require storing or replaying any of user's sensitive training data. We formalize this problem as privacy-aware few-shot continual action recognition. Towards this end, we propose POET: Prompt-Offset Tuning. While existing prompt tuning approaches have shown great promise for continual learning of image, text, and video modalities; they demand access to extensively pretrained transformers. Breaking away from this assumption, POET demonstrates the efficacy of prompt tuning a significantly lightweight backbone, pretrained exclusively on the base class data. We propose a novel spatio-temporal learnable prompt offset tuning approach, and are the first to apply such prompt tuning to Graph Neural Networks. We contribute two new benchmarks for our new problem setting in human action recognition: (i) NTU RGB+D dataset for activity recognition, and (ii) SHREC-2017 dataset for hand gesture recognition. We find that POET consistently outperforms comprehensive benchmarks. Source code at https://github.com/humansensinglab/POET-continual-action-recognition.

POET: Prompt Offset Tuning for Continual Human Action Adaptation

TL;DR

POET addresses privacy-aware, few-shot continual action recognition for skeleton-based HAR on XR devices by learning spatio-temporal prompt offsets while keeping the backbone frozen and avoiding storage of past data. The method introduces a shared prompt pool, input-conditioned prompt selection, and coupled optimization to update prompts, keys, and the query adaptor, enabling efficient adaptation to new action classes with minimal data. Empirical results on NTU RGB+D and SHREC-2017 show POET achieves state-of-the-art stability-plasticity trade-offs (high ) compared to prompt-based and traditional continual baselines, with a data-free paradigm that reduces memory and privacy risks. The approach demonstrates strong forward transfer and robust performance across graph CNN and graph transformer backbones, suggesting practical utility for deploying privacy-preserving, personalized XR action recognition systems.

Abstract

As extended reality (XR) is redefining how users interact with computing devices, research in human action recognition is gaining prominence. Typically, models deployed on immersive computing devices are static and limited to their default set of classes. The goal of our research is to provide users and developers with the capability to personalize their experience by adding new action classes to their device models continually. Importantly, a user should be able to add new classes in a low-shot and efficient manner, while this process should not require storing or replaying any of user's sensitive training data. We formalize this problem as privacy-aware few-shot continual action recognition. Towards this end, we propose POET: Prompt-Offset Tuning. While existing prompt tuning approaches have shown great promise for continual learning of image, text, and video modalities; they demand access to extensively pretrained transformers. Breaking away from this assumption, POET demonstrates the efficacy of prompt tuning a significantly lightweight backbone, pretrained exclusively on the base class data. We propose a novel spatio-temporal learnable prompt offset tuning approach, and are the first to apply such prompt tuning to Graph Neural Networks. We contribute two new benchmarks for our new problem setting in human action recognition: (i) NTU RGB+D dataset for activity recognition, and (ii) SHREC-2017 dataset for hand gesture recognition. We find that POET consistently outperforms comprehensive benchmarks. Source code at https://github.com/humansensinglab/POET-continual-action-recognition.

Paper Structure

This paper contains 29 sections, 10 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Proposed POET method continually adapts skeleton-based human action recognition models pretrained on a pre-defined set of categories to new user categories with few training examples. Users can thus expand the capabilities of XR systems with novel action classes by providing a few examples of each new class. We discard the user-sensitive data as soon as the model is updated on the new categories.
  • Figure 1: Empirical analysis to study the impact of the layer at which our prompt is attached. Y-axis shows 'Old' and 'New' classes accuracy after Task 4 (after learning all 60 classes). We add a prompt of size $P_{T'} \in \mathcal{R}^{64, 25, 64}$ to different layers $\{1, 2, 3, 4\}$ of CTR-GCN, evaluated on the NTU RGB+D validation set. We select layer L1 due to its high performance on new classes.
  • Figure 2: POET: Prompt-offset Tuning proposes to offset the input feature embedding $\mathbf{X_e}$ of the main model by learnable prompt parameters $\mathbf{P_{T}}$ for privacy-aware few-shot continual action recognition. We explain prompt selection mechanism in Fig. \ref{['fig: method, prompt selection']}.
  • Figure 2: Effect of variation in number of few-shot samples used for training in user sessions $\mathcal{US}^{(1)}$-$\mathcal{US}^{(4)}$ on stability-plasticity trade-offs in our few-shot continual setting.
  • Figure 3: Selection of our prompts $\mathbf{P_{T}}$: Input-dependent query $\boldsymbol{q}$ is matched with keys $\boldsymbol{K}$ using sorted cosine similarity to get an ordered index sequence $(s_i)_{i=1}^{T}$ of the top $T$ keys. This ordered index sequence is used to select the corresponding ordered prompt sequence $\mathbf{P_{T}}$ from prompt pool $\mathbf{P}$. We add$\mathbf{P_T}$ to $\mathbf{X_e}$, thereby adding an offset to it. Our experimental evaluation confirms that such an additive spatio-temporal prompt offset can balance the plasticity to learn new classes from a few action samples, while maintaining stability on previously learned classes.
  • ...and 11 more figures