Table of Contents
Fetching ...

Decoupled Prompt-Adapter Tuning for Continual Activity Recognition

Di Fu, Thanh Vinh Vo, Haozhe Ma, Tze-Yun Leong

TL;DR

DPAT addresses continual action recognition by decoupling prompt and adapter tuning within a frozen Vision Transformer backbone. It combines temporal and spatial adapters with learnable prompts in a two-stage training regime, first establishing generalization via Prefix tuning and then specializing with adapters while preserving prompts. A redesigned, softmax-normalized query-key matching loss enhances task-specific key selection, improving both accuracy and memory retention without replayed data. Experiments on Kinetics-400, ActivityNet, and EPIC-Kitchens-100 demonstrate state-of-the-art performance and reduced forgetting, highlighting the approach's practical value for memory-efficient, continual video understanding in real-world settings.

Abstract

Action recognition technology plays a vital role in enhancing security through surveillance systems, enabling better patient monitoring in healthcare, providing in-depth performance analysis in sports, and facilitating seamless human-AI collaboration in domains such as manufacturing and assistive technologies. The dynamic nature of data in these areas underscores the need for models that can continuously adapt to new video data without losing previously acquired knowledge, highlighting the critical role of advanced continual action recognition. To address these challenges, we propose Decoupled Prompt-Adapter Tuning (DPAT), a novel framework that integrates adapters for capturing spatial-temporal information and learnable prompts for mitigating catastrophic forgetting through a decoupled training strategy. DPAT uniquely balances the generalization benefits of prompt tuning with the plasticity provided by adapters in pretrained vision models, effectively addressing the challenge of maintaining model performance amidst continuous data evolution without necessitating extensive finetuning. DPAT consistently achieves state-of-the-art performance across several challenging action recognition benchmarks, thus demonstrating the effectiveness of our model in the domain of continual action recognition.

Decoupled Prompt-Adapter Tuning for Continual Activity Recognition

TL;DR

DPAT addresses continual action recognition by decoupling prompt and adapter tuning within a frozen Vision Transformer backbone. It combines temporal and spatial adapters with learnable prompts in a two-stage training regime, first establishing generalization via Prefix tuning and then specializing with adapters while preserving prompts. A redesigned, softmax-normalized query-key matching loss enhances task-specific key selection, improving both accuracy and memory retention without replayed data. Experiments on Kinetics-400, ActivityNet, and EPIC-Kitchens-100 demonstrate state-of-the-art performance and reduced forgetting, highlighting the approach's practical value for memory-efficient, continual video understanding in real-world settings.

Abstract

Action recognition technology plays a vital role in enhancing security through surveillance systems, enabling better patient monitoring in healthcare, providing in-depth performance analysis in sports, and facilitating seamless human-AI collaboration in domains such as manufacturing and assistive technologies. The dynamic nature of data in these areas underscores the need for models that can continuously adapt to new video data without losing previously acquired knowledge, highlighting the critical role of advanced continual action recognition. To address these challenges, we propose Decoupled Prompt-Adapter Tuning (DPAT), a novel framework that integrates adapters for capturing spatial-temporal information and learnable prompts for mitigating catastrophic forgetting through a decoupled training strategy. DPAT uniquely balances the generalization benefits of prompt tuning with the plasticity provided by adapters in pretrained vision models, effectively addressing the challenge of maintaining model performance amidst continuous data evolution without necessitating extensive finetuning. DPAT consistently achieves state-of-the-art performance across several challenging action recognition benchmarks, thus demonstrating the effectiveness of our model in the domain of continual action recognition.
Paper Structure (23 sections, 5 equations, 2 figures, 8 tables, 2 algorithms)

This paper contains 23 sections, 5 equations, 2 figures, 8 tables, 2 algorithms.

Figures (2)

  • Figure 1: Overview of the proposed Decoupled Prompt-Adapter Tuning (DPAT) approach: (a) Model architecture integrating adapters and prefix prompts to facilitate adaptation to new tasks; (b) Decoupled training paradigm designed to bolster knowledge preservation through phase-separated optimization of the model components
  • Figure 2: Comparative Result of DPAT with Joint and Decoupled Training Strategies on Kinetics-400