Mask and Compress: Efficient Skeleton-based Action Recognition in Continual Learning

Matteo Mosconi; Andriy Sorokin; Aniello Panariello; Angelo Porrello; Jacopo Bonato; Marco Cotogni; Luigi Sabetta; Simone Calderara; Rita Cucchiara

Mask and Compress: Efficient Skeleton-based Action Recognition in Continual Learning

Matteo Mosconi, Andriy Sorokin, Aniello Panariello, Angelo Porrello, Jacopo Bonato, Marco Cotogni, Luigi Sabetta, Simone Calderara, Rita Cucchiara

TL;DR

This work tackles continual learning for skeleton-based action recognition, addressing catastrophic forgetting in class-incremental settings. It introduces CHARON, a memory-efficient framework that compresses skeletal samples in a replay buffer via uniform sampling with interval $s$ and linear interpolation for reconstruction, and trains a masked encoder–decoder (inspired by masked autoencoders) to jointly optimize recognition and reconstruction, followed by a lightweight linear probing phase to align the classifier. The approach leverages a STTFormer backbone and a memory-replay objective with reconstruction and logits/labels losses, achieving state-of-the-art results on Split NTU-60 and Split NTU-120 skeleton datasets while reducing memory and compute. These contributions advance practical online HAR by enabling high performance under tight memory budgets and varied masking settings, with potential extensions to more aggressive masking regimes in the future.

Abstract

The use of skeletal data allows deep learning models to perform action recognition efficiently and effectively. Herein, we believe that exploring this problem within the context of Continual Learning is crucial. While numerous studies focus on skeleton-based action recognition from a traditional offline perspective, only a handful venture into online approaches. In this respect, we introduce CHARON (Continual Human Action Recognition On skeletoNs), which maintains consistent performance while operating within an efficient framework. Through techniques like uniform sampling, interpolation, and a memory-efficient training stage based on masking, we achieve improved recognition accuracy while minimizing computational overhead. Our experiments on Split NTU-60 and the proposed Split NTU-120 datasets demonstrate that CHARON sets a new benchmark in this domain. The code is available at https://github.com/Sperimental3/CHARON.

Mask and Compress: Efficient Skeleton-based Action Recognition in Continual Learning

TL;DR

and linear interpolation for reconstruction, and trains a masked encoder–decoder (inspired by masked autoencoders) to jointly optimize recognition and reconstruction, followed by a lightweight linear probing phase to align the classifier. The approach leverages a STTFormer backbone and a memory-replay objective with reconstruction and logits/labels losses, achieving state-of-the-art results on Split NTU-60 and Split NTU-120 skeleton datasets while reducing memory and compute. These contributions advance practical online HAR by enabling high performance under tight memory budgets and varied masking settings, with potential extensions to more aggressive masking regimes in the future.

Abstract

Paper Structure (15 sections, 10 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 15 sections, 10 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Related works
Method
Preliminaries
CHARON
Experimental analysis
Datasets
Implementation details
Results
Ablations
Conclusions
Acknowledgements
Split NTU-120
Skeleton-based MAE
Interpolation reconstruction error

Figures (5)

Figure 1: Figure showing the key components of CHARON. Our efficient buffer strategy is shown on the left $(a)$. In the upper right $(b)$, we showcase the training phase with the reconstruction regularization, while linear probing is displayed at the bottom $(c)$. Best seen in colors.
Figure 2: Epochs per hour at different masking ratio values.
Figure 3: Linear probing contribute on joint training with varying masking ratios.
Figure 4: (left) FAA for the DER++ baseline employing different values of the sampling interval $s$. (right) FAA obtained by CHARON as the masking ratio varies.
Figure A: Figure showing the reconstruction error of the interpolation procedure as the sampling interval varies.

Mask and Compress: Efficient Skeleton-based Action Recognition in Continual Learning

TL;DR

Abstract

Mask and Compress: Efficient Skeleton-based Action Recognition in Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)