Continual Text-to-Video Retrieval with Frame Fusion and Task-Aware Routing
Zecheng Zhao, Zhi Chen, Zi Huang, Shazia Sadiq, Tong Chen
TL;DR
This paper defines and tackles Continual Text-to-Video Retrieval (CTVR), a setting where a PTM-based TVR system must continually adapt to new video content while preserving performance on prior tasks. It introduces FrameFusionMoE, a parameter-efficient framework with two novel components: Frame Fusion Adapter (FFA) to capture temporal video dynamics without eroding the CLIP embedding space, and Task-Aware Mixture-of-Experts (TAME) to route text queries to task-specific experts and maintain alignment with cached video features. The method optimizes cross-modal retrieval with v2t and t2v losses and adds a Cross-Task loss to regularize representations against historical videos, achieving near-zero backward forgetting across multiple benchmarks. Experiments on MSRVTT and ActivityNet demonstrate superior retrieval performance and robustness to task sequence, with substantial efficiency gains over parallel CL baselines, highlighting practical impact for dynamic video retrieval systems.
Abstract
Text-to-Video Retrieval (TVR) aims to retrieve relevant videos based on textual queries. However, as video content evolves continuously, adapting TVR systems to new data remains a critical yet under-explored challenge. In this paper, we introduce the first benchmark for Continual Text-to-Video Retrieval (CTVR) to address the limitations of existing approaches. Current Pre-Trained Model (PTM)-based TVR methods struggle with maintaining model plasticity when adapting to new tasks, while existing Continual Learning (CL) methods suffer from catastrophic forgetting, leading to semantic misalignment between historical queries and stored video features. To address these two challenges, we propose FrameFusionMoE, a novel CTVR framework that comprises two key components: (1) the Frame Fusion Adapter (FFA), which captures temporal video dynamics while preserving model plasticity, and (2) the Task-Aware Mixture-of-Experts (TAME), which ensures consistent semantic alignment between queries across tasks and the stored video features. Thus, FrameFusionMoE enables effective adaptation to new video content while preserving historical text-video relevance to mitigate catastrophic forgetting. We comprehensively evaluate FrameFusionMoE on two benchmark datasets under various task settings. Results demonstrate that FrameFusionMoE outperforms existing CL and TVR methods, achieving superior retrieval performance with minimal degradation on earlier tasks when handling continuous video streams. Our code is available at: https://github.com/JasonCodeMaker/CTVR.
