Foundation Models for Video Understanding: A Survey
Neelu Madan, Andreas Moegelmose, Rajat Modi, Yogesh S. Rawat, Thomas B. Moeslund
TL;DR
This survey addresses the need for a unified view of Video Foundation Models (ViFMs) by categorizing them into image-based, video-based, and universal multimodal models and evaluating them across 14 video tasks. It highlights that image-based ViFMs often outperform dedicated video models on many tasks, while UFMs with multiple modalities yield the strongest performance on video understanding, particularly in retrieval and cross-modal tasks. The work synthesizes pretraining strategies, architectures, loss functions, and datasets, offering actionable insights on inflation techniques (post-pretraining, adapters, prompts) and guidance for future research in scalable, ethical, and efficient ViFMs. Overall, the survey provides a roadmap for developing robust, generalizable video understanding systems that leverage large-scale multimodal pretraining and flexible deployment strategies.
Abstract
Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs achieve this by capturing robust and generic features from video data. This survey analyzes over 200 video foundational models, offering a comprehensive overview of benchmarks and evaluation metrics across 14 distinct video tasks categorized into 3 main categories. Additionally, we offer an in-depth performance analysis of these models for the 6 most common video tasks. We categorize ViFMs into three categories: 1) Image-based ViFMs, which adapt existing image models for video tasks, 2) Video-Based ViFMs, which utilize video-specific encoding methods, and 3) Universal Foundational Models (UFMs), which combine multiple modalities (image, video, audio, and text etc.) within a single framework. By comparing the performance of various ViFMs on different tasks, this survey offers valuable insights into their strengths and weaknesses, guiding future advancements in video understanding. Our analysis surprisingly reveals that image-based foundation models consistently outperform video-based models on most video understanding tasks. Additionally, UFMs, which leverage diverse modalities, demonstrate superior performance on video tasks. We share the comprehensive list of ViFMs studied in this work at: \url{https://github.com/NeeluMadan/ViFM_Survey.git}
