Table of Contents
Fetching ...

Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey

Jinxuan Li, Chaolei Tan, Haoxuan Chen, Jianxin Ma, Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai

TL;DR

This survey addresses how to transfer knowledge from image-language foundation models to the video domain, focusing on frozen versus modified feature paradigms and their suitability for a spectrum of fine-grained to coarse-grained video–text tasks. It systematically classifies methods, analyzes their strengths and limitations, and provides an experimental panorama across TVG, VTR, VAR, VideoQA, captioning, STVG, OVVIS, and OV-MOT tasks. The work highlights that no single paradigm dominates; CLIP-based frozen approaches and LoRA/adapter-based fine-tuning often yield strong results, while LLM-based pipelines can excel in reasoning-driven tasks, albeit with higher complexity. It also outlines future directions toward unified, multi-task transfer learning, cross-model collaboration, and advanced fusion techniques, aiming to create scalable, generalizable video-understanding systems informed by robust image-language foundations.

Abstract

Image-Language Foundation Models (ILFM) have demonstrated remarkable success in image-text understanding/generation tasks, providing transferable multimodal representations that generalize across diverse downstream image-based tasks. The advancement of video-text research has spurred growing interest in extending image-based models to the video domain. This paradigm, known as image-to-video transfer learning, succeeds in alleviating the substantial data and computational requirements associated with training video-language foundation models from scratch for video-text learning. This survey provides the first comprehensive review of this emerging field, which begins by summarizing the widely used ILFM and their capabilities. We then systematically classify existing image-to-video transfer learning strategies into two categories: frozen features and modified features, depending on whether the original representations from ILFM are preserved or undergo modifications. Building upon the task-specific nature of image-to-video transfer, this survey methodically elaborates these strategies and details their applications across a spectrum of video-text learning tasks, ranging from fine-grained (e.g., spatio-temporal video grounding) to coarse-grained (e.g., video question answering). We further present a detailed experimental analysis to investigate the efficacy of different image-to-video transfer learning paradigms on a range of downstream video understanding tasks. Finally, we identify prevailing challenges and highlight promising directions for future research. By offering a comprehensive and structured overview, this survey aims to establish a structured roadmap for advancing video-text learning based on existing ILFM, and to inspire future research directions in this rapidly evolving domain.

Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey

TL;DR

This survey addresses how to transfer knowledge from image-language foundation models to the video domain, focusing on frozen versus modified feature paradigms and their suitability for a spectrum of fine-grained to coarse-grained video–text tasks. It systematically classifies methods, analyzes their strengths and limitations, and provides an experimental panorama across TVG, VTR, VAR, VideoQA, captioning, STVG, OVVIS, and OV-MOT tasks. The work highlights that no single paradigm dominates; CLIP-based frozen approaches and LoRA/adapter-based fine-tuning often yield strong results, while LLM-based pipelines can excel in reasoning-driven tasks, albeit with higher complexity. It also outlines future directions toward unified, multi-task transfer learning, cross-model collaboration, and advanced fusion techniques, aiming to create scalable, generalizable video-understanding systems informed by robust image-language foundations.

Abstract

Image-Language Foundation Models (ILFM) have demonstrated remarkable success in image-text understanding/generation tasks, providing transferable multimodal representations that generalize across diverse downstream image-based tasks. The advancement of video-text research has spurred growing interest in extending image-based models to the video domain. This paradigm, known as image-to-video transfer learning, succeeds in alleviating the substantial data and computational requirements associated with training video-language foundation models from scratch for video-text learning. This survey provides the first comprehensive review of this emerging field, which begins by summarizing the widely used ILFM and their capabilities. We then systematically classify existing image-to-video transfer learning strategies into two categories: frozen features and modified features, depending on whether the original representations from ILFM are preserved or undergo modifications. Building upon the task-specific nature of image-to-video transfer, this survey methodically elaborates these strategies and details their applications across a spectrum of video-text learning tasks, ranging from fine-grained (e.g., spatio-temporal video grounding) to coarse-grained (e.g., video question answering). We further present a detailed experimental analysis to investigate the efficacy of different image-to-video transfer learning paradigms on a range of downstream video understanding tasks. Finally, we identify prevailing challenges and highlight promising directions for future research. By offering a comprehensive and structured overview, this survey aims to establish a structured roadmap for advancing video-text learning based on existing ILFM, and to inspire future research directions in this rapidly evolving domain.

Paper Structure

This paper contains 39 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of two representative strategies for addressing video-text understanding problem with image-text and video-text foundation models. Pretraining video-based foundation models is much more challenging than pretraining image-based foundation models, as they often contain more parameters and require larger amounts of training data and computational resources.
  • Figure 2: Popular image-textual foundation models developed in the community, which are widely used in the image-to-video transfer learning for addressing downstream video-language understanding problems.
  • Figure 3: Taxonomy of Image-to-video Transfer Learning Methods. Existing methods are categorized according to the employed transfer learning mechanisms, forming two major groups: frozen features and modified features. The frozen-feature group includes approaches such as knowledge distillation, post-network tuning, and side-tuning, while the modified-feature group encompasses various fine-tuning strategies, including full fine-tuning, adapter-based tuning, LoRA, and prompt tuning. Each category is further divided into fine-grained and coarse-grained paradigms, reflecting different levels of temporal and semantic alignment between visual and textual modalities.
  • Figure 4: Methods and architectures of transferring pre-trained image-text model to video domain via frozen features.
  • Figure 5: Methods and architectures of transferring pre-trained image-text model to video domain via modified features.