Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting

Chen Cai; Zheng Wang; Jianjun Gao; Wenyang Liu; Ye Lu; Runzhong Zhang; Kim-Hui Yap

Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting

Chen Cai, Zheng Wang, Jianjun Gao, Wenyang Liu, Ye Lu, Runzhong Zhang, Kim-Hui Yap

TL;DR

The paper tackles continual VideoQA by addressing catastrophic forgetting that arises when fine-tuning an LLM across sequential tasks. It introduces Collaborative Prompting (ColPro), which integrates three prompting strategies—task-specific question constraint prompting ($TQCP$), knowledge acquisition prompting ($KAP$), and visual temporal awareness prompting ($VTAP$)—with Specialized prompts $\mathbf{P}_e$ and $\mathbf{P}_g$ attached to the initial layers of an LLM (e.g., LLaMA) to encode textual, visual, and temporal information. Through a rehearsal-free continual learning framework, ColPro demonstrates state-of-the-art performance on NExT-QA and DramaQA, achieving 55.14% and 71.24% accuracy respectively, while reducing average forgetting. The method relies on a compact parameter-efficient scheme with targeted losses ($\mathcal{L}_q$, $\mathcal{L}_v$, $\mathcal{L}_a$) and leverages multimodal prompts to transfer knowledge across tasks, offering practical impact for adaptive VideoQA in dynamic, real-world video streams. Limitations include residual forgetting on DramaQA with the tested model size and computational constraints limiting exploration of larger LLMs.

Abstract

In recent years, the rapid increase in online video content has underscored the limitations of static Video Question Answering (VideoQA) models trained on fixed datasets, as they struggle to adapt to new questions or tasks posed by newly available content. In this paper, we explore the novel challenge of VideoQA within a continual learning framework, and empirically identify a critical issue: fine-tuning a large language model (LLM) for a sequence of tasks often results in catastrophic forgetting. To address this, we propose Collaborative Prompting (ColPro), which integrates specific question constraint prompting, knowledge acquisition prompting, and visual temporal awareness prompting. These prompts aim to capture textual question context, visual content, and video temporal dynamics in VideoQA, a perspective underexplored in prior research. Experimental results on the NExT-QA and DramaQA datasets show that ColPro achieves superior performance compared to existing approaches, achieving 55.14\% accuracy on NExT-QA and 71.24\% accuracy on DramaQA, highlighting its practical relevance and effectiveness.

Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting

TL;DR

), knowledge acquisition prompting (

), and visual temporal awareness prompting (

)—with Specialized prompts

and

attached to the initial layers of an LLM (e.g., LLaMA) to encode textual, visual, and temporal information. Through a rehearsal-free continual learning framework, ColPro demonstrates state-of-the-art performance on NExT-QA and DramaQA, achieving 55.14% and 71.24% accuracy respectively, while reducing average forgetting. The method relies on a compact parameter-efficient scheme with targeted losses (

) and leverages multimodal prompts to transfer knowledge across tasks, offering practical impact for adaptive VideoQA in dynamic, real-world video streams. Limitations include residual forgetting on DramaQA with the tested model size and computational constraints limiting exploration of larger LLMs.

Abstract

Paper Structure (21 sections, 11 equations, 4 figures, 9 tables)

This paper contains 21 sections, 11 equations, 4 figures, 9 tables.

Introduction
Related Works
Video Question Answering
Continual Learning for Visual Question Answering
Methodology
Motivation and Objective
Collaborative Prompting
Experiments
Datasets
Evaluation Metrics
Implementation Details
Comparison with Continual Learning Methods
Comparison with Parameter-Efficient Fine Tuning Methods
Task-by-Task Average Accuracy
Ablation Study
...and 6 more sections

Figures (4)

Figure 1: (a) Existing fine-tuning techniques train for different VideoQA tasks, which could lead to catastrophic forgetting, and generate inferior results. (b) We introduce the Collaborative Prompting (ColPro) within a continual learning framework, which retains task-specific knowledge to generate accurate answers, where $\mathbf{P}_N$ denotes a projection layer.
Figure 2: Illustration of the Collaborative Prompting (ColPro) framework. Left: The training process incorporates ColPro into the first $j$ ColPro Guided Pre-trained Layers to enhance answer prediction accuracy while minimizing forgetting. Right: Three detailed prompting techniques within ColPro are demonstrated: task-specific question constraint prompting (TQCP), knowledge acquisition prompting (KAP), and visual temporal awareness prompting (VTAP). Together, these techniques encapsulate the textual question context, visual content, and video temporal dynamics for each VideoQA task.
Figure 3: The results of the average accuracy for each task, which following the training order within the CL setting.
Figure 4: The video examples with their corresponding questions and answers for each task.

Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting

TL;DR

Abstract

Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting

Authors

TL;DR

Abstract

Table of Contents

Figures (4)