Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Jinxuan Li, Chaolei Tan, Haoxuan Chen, Jianxin Ma, Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai
TL;DR
This survey addresses how to transfer knowledge from image-language foundation models to the video domain, focusing on frozen versus modified feature paradigms and their suitability for a spectrum of fine-grained to coarse-grained video–text tasks. It systematically classifies methods, analyzes their strengths and limitations, and provides an experimental panorama across TVG, VTR, VAR, VideoQA, captioning, STVG, OVVIS, and OV-MOT tasks. The work highlights that no single paradigm dominates; CLIP-based frozen approaches and LoRA/adapter-based fine-tuning often yield strong results, while LLM-based pipelines can excel in reasoning-driven tasks, albeit with higher complexity. It also outlines future directions toward unified, multi-task transfer learning, cross-model collaboration, and advanced fusion techniques, aiming to create scalable, generalizable video-understanding systems informed by robust image-language foundations.
Abstract
Image-Language Foundation Models (ILFM) have demonstrated remarkable success in image-text understanding/generation tasks, providing transferable multimodal representations that generalize across diverse downstream image-based tasks. The advancement of video-text research has spurred growing interest in extending image-based models to the video domain. This paradigm, known as image-to-video transfer learning, succeeds in alleviating the substantial data and computational requirements associated with training video-language foundation models from scratch for video-text learning. This survey provides the first comprehensive review of this emerging field, which begins by summarizing the widely used ILFM and their capabilities. We then systematically classify existing image-to-video transfer learning strategies into two categories: frozen features and modified features, depending on whether the original representations from ILFM are preserved or undergo modifications. Building upon the task-specific nature of image-to-video transfer, this survey methodically elaborates these strategies and details their applications across a spectrum of video-text learning tasks, ranging from fine-grained (e.g., spatio-temporal video grounding) to coarse-grained (e.g., video question answering). We further present a detailed experimental analysis to investigate the efficacy of different image-to-video transfer learning paradigms on a range of downstream video understanding tasks. Finally, we identify prevailing challenges and highlight promising directions for future research. By offering a comprehensive and structured overview, this survey aims to establish a structured roadmap for advancing video-text learning based on existing ILFM, and to inspire future research directions in this rapidly evolving domain.
