Table of Contents
Fetching ...

Continual Text-to-Video Retrieval with Frame Fusion and Task-Aware Routing

Zecheng Zhao, Zhi Chen, Zi Huang, Shazia Sadiq, Tong Chen

TL;DR

This paper defines and tackles Continual Text-to-Video Retrieval (CTVR), a setting where a PTM-based TVR system must continually adapt to new video content while preserving performance on prior tasks. It introduces FrameFusionMoE, a parameter-efficient framework with two novel components: Frame Fusion Adapter (FFA) to capture temporal video dynamics without eroding the CLIP embedding space, and Task-Aware Mixture-of-Experts (TAME) to route text queries to task-specific experts and maintain alignment with cached video features. The method optimizes cross-modal retrieval with v2t and t2v losses and adds a Cross-Task loss to regularize representations against historical videos, achieving near-zero backward forgetting across multiple benchmarks. Experiments on MSRVTT and ActivityNet demonstrate superior retrieval performance and robustness to task sequence, with substantial efficiency gains over parallel CL baselines, highlighting practical impact for dynamic video retrieval systems.

Abstract

Text-to-Video Retrieval (TVR) aims to retrieve relevant videos based on textual queries. However, as video content evolves continuously, adapting TVR systems to new data remains a critical yet under-explored challenge. In this paper, we introduce the first benchmark for Continual Text-to-Video Retrieval (CTVR) to address the limitations of existing approaches. Current Pre-Trained Model (PTM)-based TVR methods struggle with maintaining model plasticity when adapting to new tasks, while existing Continual Learning (CL) methods suffer from catastrophic forgetting, leading to semantic misalignment between historical queries and stored video features. To address these two challenges, we propose FrameFusionMoE, a novel CTVR framework that comprises two key components: (1) the Frame Fusion Adapter (FFA), which captures temporal video dynamics while preserving model plasticity, and (2) the Task-Aware Mixture-of-Experts (TAME), which ensures consistent semantic alignment between queries across tasks and the stored video features. Thus, FrameFusionMoE enables effective adaptation to new video content while preserving historical text-video relevance to mitigate catastrophic forgetting. We comprehensively evaluate FrameFusionMoE on two benchmark datasets under various task settings. Results demonstrate that FrameFusionMoE outperforms existing CL and TVR methods, achieving superior retrieval performance with minimal degradation on earlier tasks when handling continuous video streams. Our code is available at: https://github.com/JasonCodeMaker/CTVR.

Continual Text-to-Video Retrieval with Frame Fusion and Task-Aware Routing

TL;DR

This paper defines and tackles Continual Text-to-Video Retrieval (CTVR), a setting where a PTM-based TVR system must continually adapt to new video content while preserving performance on prior tasks. It introduces FrameFusionMoE, a parameter-efficient framework with two novel components: Frame Fusion Adapter (FFA) to capture temporal video dynamics without eroding the CLIP embedding space, and Task-Aware Mixture-of-Experts (TAME) to route text queries to task-specific experts and maintain alignment with cached video features. The method optimizes cross-modal retrieval with v2t and t2v losses and adds a Cross-Task loss to regularize representations against historical videos, achieving near-zero backward forgetting across multiple benchmarks. Experiments on MSRVTT and ActivityNet demonstrate superior retrieval performance and robustness to task sequence, with substantial efficiency gains over parallel CL baselines, highlighting practical impact for dynamic video retrieval systems.

Abstract

Text-to-Video Retrieval (TVR) aims to retrieve relevant videos based on textual queries. However, as video content evolves continuously, adapting TVR systems to new data remains a critical yet under-explored challenge. In this paper, we introduce the first benchmark for Continual Text-to-Video Retrieval (CTVR) to address the limitations of existing approaches. Current Pre-Trained Model (PTM)-based TVR methods struggle with maintaining model plasticity when adapting to new tasks, while existing Continual Learning (CL) methods suffer from catastrophic forgetting, leading to semantic misalignment between historical queries and stored video features. To address these two challenges, we propose FrameFusionMoE, a novel CTVR framework that comprises two key components: (1) the Frame Fusion Adapter (FFA), which captures temporal video dynamics while preserving model plasticity, and (2) the Task-Aware Mixture-of-Experts (TAME), which ensures consistent semantic alignment between queries across tasks and the stored video features. Thus, FrameFusionMoE enables effective adaptation to new video content while preserving historical text-video relevance to mitigate catastrophic forgetting. We comprehensively evaluate FrameFusionMoE on two benchmark datasets under various task settings. Results demonstrate that FrameFusionMoE outperforms existing CL and TVR methods, achieving superior retrieval performance with minimal degradation on earlier tasks when handling continuous video streams. Our code is available at: https://github.com/JasonCodeMaker/CTVR.

Paper Structure

This paper contains 20 sections, 10 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: An illustration of Continual Text-to-Video Retrieval (CTVR) pipeline. A Pre-Trained Model (PTM) continuously adapts to a sequence of TVR tasks through continual learning. Video features extracted in the current task are stored in a database and leveraged for subsequent tasks. During inference, all task queries can retrieve relevant videos within the video feature database.
  • Figure 2: Visualization of model plasticity across sequential tasks (T), indexed chronologically. The first column T0 denotes the initial state of the pre-trained model without any updates. The presented results are performance variation on previous tasks after training on the current task (green for increase/red for drop) compared with the CLIP zero-shot results on MSRVTT dataset. (a) The state-of-the-art TVR method X-Pool gorti2022XPool exhibits declining plasticity to new tasks, i.e., underperform the zero-shot performance on the late stage tasks. (b) Our approach consistently improves task-wise performance while maintaining low backward forgetting when adapting to new tasks.
  • Figure 3: The catastrophic forgetting problem, illustrated by text feature shifts on MSRVTT. Each $\bullet$ or $\blacksquare$ represents a query in Task 1 or Task 5, respectively. In addition, we use different colors to mark the states of Task 1 queries after each task update. Ideally, if there is no forgetting at all, each Task 1 query should have no movements in the embedding space after learning new tasks. (a) LwF li2017LwF, a strong CL baseline shows the query embeddings shift from the original position while model keeps updating, as highlighted by the scattered colors $\Box$. (b) Our approach maintains stable features, with minimal shifts across tasks, as evidenced by the overlap among different colors.
  • Figure 4: Overall framework of the FrameFusionMoE. It consists of three core components: (a) A Task-Aware MoE Adapter (TAME) that is added to a frozen CLIP text encoder to learn the distribution of text query through the selection of multiple experts $\{{\bm{B}}_i\}_{i=1}^{n_e}$. The expert weights ${\bm{w}}^\text{MoE}$ are determined by a router taking the element-wise addition ($\oplus$) of the $[\text{EOS}]$ token and task prototype ${\bm{p}}_t$ as input. (b) A vision processing pipeline where frame features are processed through a frozen CLIP vision encoder and Frame Fusion Adapters (FFA). Each FFA uses previous frame feature maps ${\bm{F}}_{m-1}$ to attend over current frame ${\bm{F}}_m$ through multi-head temporal cross attention. The FFA output serves as a temporal guidance signal that is added back to each spatial self-attention layer. (C) The Cross-Task Loss ($\mathcal{L}_{\text{CT}}$) optimizes representations by drawing matched text-video pairs closer while pushing away cached video features that serve as negative samples.
  • Figure 5: Visualization of attention maps from FFA Temporal CA and CLIP Spatial SA mechanisms. Brighter regions in the attention maps indicate higher attention weights. FFA's temporal CA demonstrates stronger attention weights on temporally consistent regions between frames (e.g., track surface, background) while showing lower attention on the changing sand pit area, effectively capturing inter-frame consistency. CLIP's spatial SA focuses on the athlete and their jumping action, capturing semantically important motion information within the frame.
  • ...and 3 more figures