Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

Tz-Ying Wu; Kyle Min; Subarna Tripathi; Nuno Vasconcelos

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

Tz-Ying Wu, Kyle Min, Subarna Tripathi, Nuno Vasconcelos

TL;DR

This work tackles the practical problem of adapting egocentric video foundation models with minimal compute. By introducing Ego-VPA, a prompt-tuning framework that learns a shared, orthogonal basis of prompts and reconstructs per-frame prompts via sparse subspace projection, the method enables context fusion across frames and cross-modal transfer between video and text with only $0.84\%$ of trainable parameters. Cross-modal prompt synthesis further leverages a common basis to align visual and textual representations, improving video-language grounding beyond existing prompt-tuning baselines and even rivaling full fine-tuning on several datasets. Empirical results on Charades-Ego, EGTEA, and EPIC-Kitchens-100 demonstrate strong performance gains, data-efficiency, and generalization to retrieval tasks, underscoring the practical impact for scalable egocentric video understanding.

Abstract

Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts, and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities, it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters), largely improving over baselines and reaching the performance of full fine-tuning.

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

TL;DR

of trainable parameters. Cross-modal prompt synthesis further leverages a common basis to align visual and textual representations, improving video-language grounding beyond existing prompt-tuning baselines and even rivaling full fine-tuning on several datasets. Empirical results on Charades-Ego, EGTEA, and EPIC-Kitchens-100 demonstrate strong performance gains, data-efficiency, and generalization to retrieval tasks, underscoring the practical impact for scalable egocentric video understanding.

Abstract

Paper Structure (16 sections, 14 equations, 7 figures, 4 tables)

This paper contains 16 sections, 14 equations, 7 figures, 4 tables.

Introduction
Related Works
Egocentric Video Understanding with VFMs
VFM Preliminaries
Generalization Ability of VFMs for Egocentric Video
Ego-VFM Prompt-tuning Baselines
Ego-VPA
Video Prompt Synthesis
Cross-modal Prompt Synthesis
Training
Experiments
Experimental Setup
Comparisons to SOTA Prompt-tuning Methods
Ablation Studies
Generalization to Retrieval Tasks
...and 1 more sections

Figures (7)

Figure 1: Ego-VPA leverages context-aware prompts to achieve parameter-efficient adaptation for egocentric videos. (Left) Performance vs tunable parameters; (Right) Cross modality prompt-tuning in Ego-VPA where the VFM is frozen.
Figure 1: The zero-shot - fine-tuned performance gap (mAP on Charades-Ego) exists in both CLIP-based VFMs ni2022expandingwasim2023vita and Ego-VFMs zhao2023lavila . $^{\star}$ denotes that only the prompts/adapters are fine-tuned on Charades-Ego.
Figure 2: Models. We adapt SOTA prompt-tuning methods to Ego-VFMs (See section \ref{['sec:settings']}), i.e. TPT, VPT, and VoPF+C, where CMM is a context modeling module. The proposed Ego-VPA leverages a set of basis prompts ${\cal F}$ for cross-modal prompt synthesis, enabling context modeling across frames and modalities in a highly efficient way (See section \ref{['sec:method']}).
Figure 3: Prompt Synthesis. Token ${\bf z}$ is projected into the subspace by $h(\cdot)$, and sparsely approximated by the top-$k$ similar prompts in the prompt basis ${\cal F}$, which are finally mapped into $k$ prompts by the mapping $g(\cdot)$. $(h,g)$ can be $(h_{vid},g_{vid})$ or $(h_{txt},g_{txt})$ for visual or text prompt generation, respectively. $k=4$ in this illustration.
Figure 4: Cross-modal Prompt Synthesis. The basis prompts $\cal F$ are shared across frames and modalities, but different mapping functions $h,g$ are adopted per modality to synthesize the prompts.
...and 2 more figures

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

TL;DR

Abstract

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)