Table of Contents
Fetching ...

CL-VISTA: Benchmarking Continual Learning in Video Large Language Models

Haiyang Guo, Yichen Shi, Fei Zhu, Wenzhuo Liu, Hongbo Zhao, Fanhu Zeng, Shijie Ma, Da-Han Wang, Xu-Yao Zhang

Abstract

Video Large Language Models (Video-LLMs) require continual learning to adapt to non-stationary real-world data. However, existing benchmarks fall short of evaluating modern foundation models: many still rely on models without large-scale pre-training, and prevailing benchmarks typically partition a single dataset into sub-tasks, resulting in high task redundancy and negligible forgetting on pre-trained Video-LLMs. To address these limitations, we propose CL-VISTA, a benchmark tailored for continual video understanding of Video-LLMs. By curating 8 diverse tasks spanning perception, understanding, and reasoning, CL-VISTA induces substantial distribution shifts that effectively expose catastrophic forgetting. To systematically assess CL methods, we establish a comprehensive evaluation framework comprising 6 distinct protocols across 3 critical dimensions: performance, computational efficiency, and memory footprint. Notably, the performance dimension incorporates a general video understanding assessment to assess whether CL methods genuinely enhance foundational intelligence or merely induce task-specific overfitting. Extensive benchmarking of 10 mainstream CL methods reveals a fundamental trade-off: no single approach achieves universal superiority across all dimensions. Methods that successfully mitigate catastrophic forgetting tend to compromise generalization or incur prohibitive computational and memory overheads. We hope CL-VISTA provides critical insights for advancing continual learning in multimodal foundation models.

CL-VISTA: Benchmarking Continual Learning in Video Large Language Models

Abstract

Video Large Language Models (Video-LLMs) require continual learning to adapt to non-stationary real-world data. However, existing benchmarks fall short of evaluating modern foundation models: many still rely on models without large-scale pre-training, and prevailing benchmarks typically partition a single dataset into sub-tasks, resulting in high task redundancy and negligible forgetting on pre-trained Video-LLMs. To address these limitations, we propose CL-VISTA, a benchmark tailored for continual video understanding of Video-LLMs. By curating 8 diverse tasks spanning perception, understanding, and reasoning, CL-VISTA induces substantial distribution shifts that effectively expose catastrophic forgetting. To systematically assess CL methods, we establish a comprehensive evaluation framework comprising 6 distinct protocols across 3 critical dimensions: performance, computational efficiency, and memory footprint. Notably, the performance dimension incorporates a general video understanding assessment to assess whether CL methods genuinely enhance foundational intelligence or merely induce task-specific overfitting. Extensive benchmarking of 10 mainstream CL methods reveals a fundamental trade-off: no single approach achieves universal superiority across all dimensions. Methods that successfully mitigate catastrophic forgetting tend to compromise generalization or incur prohibitive computational and memory overheads. We hope CL-VISTA provides critical insights for advancing continual learning in multimodal foundation models.

Paper Structure

This paper contains 25 sections, 1 equation, 14 figures, 17 tables.

Figures (14)

  • Figure 1: Comparison of CL benchmarks. (a) The significant performance gap between zero-shot and joint training indicates that these tasks require Video-LLMs to learn genuinely new knowledge. On NextQA xiao2021next and STAR wu2024star, sequential fine-tuning (Seq-FT) shows negligible forgetting, nearly matching the joint training upper bound. Conversely, our CL-VISTA benchmark maintains a realistic performance degradation under Seq-FT. (b) Backward Transfer (BWT) measures the impact of new learning on past tasks. Negative BWT values reveal that CL-VISTA consistently induces genuine catastrophic forgetting, whereas existing benchmarks exhibit unrealistic positive BWT.
  • Figure 2: Overview of the CL-VISTA Benchmark: Architecture and Key Functionalities.
  • Figure 3: Embedding discriminability analysis. Compared to previous benchmarks where task embeddings are highly entangled, our proposed setting yields clearer task separation. Quantitative results (right) further confirm that our benchmark achieves significantly larger inter-task distances, facilitating better task-incremental evaluation.
  • Figure 4: Learning trajectory analysis. Compared to individual training (pink), CL-VISTA’s continual training (blue) exhibits distinct loss spikes at task boundaries. Existing benchmarks show smooth, homogeneous transitions, failing to challenge Video-LLMs with the clear distribution shifts necessary for evaluating continual learning.
  • Figure 5: Overview of the data reconstruction pipeline. The pipeline begins with the generation of sentence-level question-answer pairs derived from segmented video clips and their corresponding annotations. These candidates are subsequently refined through a multi-discriminator filtering mechanism to yield a final collection of high-quality and diverse QA pairs.
  • ...and 9 more figures