Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
Tz-Ying Wu, Kyle Min, Subarna Tripathi, Nuno Vasconcelos
TL;DR
This work tackles the practical problem of adapting egocentric video foundation models with minimal compute. By introducing Ego-VPA, a prompt-tuning framework that learns a shared, orthogonal basis of prompts and reconstructs per-frame prompts via sparse subspace projection, the method enables context fusion across frames and cross-modal transfer between video and text with only $0.84\%$ of trainable parameters. Cross-modal prompt synthesis further leverages a common basis to align visual and textual representations, improving video-language grounding beyond existing prompt-tuning baselines and even rivaling full fine-tuning on several datasets. Empirical results on Charades-Ego, EGTEA, and EPIC-Kitchens-100 demonstrate strong performance gains, data-efficiency, and generalization to retrieval tasks, underscoring the practical impact for scalable egocentric video understanding.
Abstract
Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts, and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities, it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters), largely improving over baselines and reaching the performance of full fine-tuning.
