Decoupled Prompt-Adapter Tuning for Continual Activity Recognition
Di Fu, Thanh Vinh Vo, Haozhe Ma, Tze-Yun Leong
TL;DR
DPAT addresses continual action recognition by decoupling prompt and adapter tuning within a frozen Vision Transformer backbone. It combines temporal and spatial adapters with learnable prompts in a two-stage training regime, first establishing generalization via Prefix tuning and then specializing with adapters while preserving prompts. A redesigned, softmax-normalized query-key matching loss enhances task-specific key selection, improving both accuracy and memory retention without replayed data. Experiments on Kinetics-400, ActivityNet, and EPIC-Kitchens-100 demonstrate state-of-the-art performance and reduced forgetting, highlighting the approach's practical value for memory-efficient, continual video understanding in real-world settings.
Abstract
Action recognition technology plays a vital role in enhancing security through surveillance systems, enabling better patient monitoring in healthcare, providing in-depth performance analysis in sports, and facilitating seamless human-AI collaboration in domains such as manufacturing and assistive technologies. The dynamic nature of data in these areas underscores the need for models that can continuously adapt to new video data without losing previously acquired knowledge, highlighting the critical role of advanced continual action recognition. To address these challenges, we propose Decoupled Prompt-Adapter Tuning (DPAT), a novel framework that integrates adapters for capturing spatial-temporal information and learnable prompts for mitigating catastrophic forgetting through a decoupled training strategy. DPAT uniquely balances the generalization benefits of prompt tuning with the plasticity provided by adapters in pretrained vision models, effectively addressing the challenge of maintaining model performance amidst continuous data evolution without necessitating extensive finetuning. DPAT consistently achieves state-of-the-art performance across several challenging action recognition benchmarks, thus demonstrating the effectiveness of our model in the domain of continual action recognition.
