Table of Contents
Fetching ...

Foundation Models for Video Understanding: A Survey

Neelu Madan, Andreas Moegelmose, Rajat Modi, Yogesh S. Rawat, Thomas B. Moeslund

TL;DR

This survey addresses the need for a unified view of Video Foundation Models (ViFMs) by categorizing them into image-based, video-based, and universal multimodal models and evaluating them across 14 video tasks. It highlights that image-based ViFMs often outperform dedicated video models on many tasks, while UFMs with multiple modalities yield the strongest performance on video understanding, particularly in retrieval and cross-modal tasks. The work synthesizes pretraining strategies, architectures, loss functions, and datasets, offering actionable insights on inflation techniques (post-pretraining, adapters, prompts) and guidance for future research in scalable, ethical, and efficient ViFMs. Overall, the survey provides a roadmap for developing robust, generalizable video understanding systems that leverage large-scale multimodal pretraining and flexible deployment strategies.

Abstract

Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs achieve this by capturing robust and generic features from video data. This survey analyzes over 200 video foundational models, offering a comprehensive overview of benchmarks and evaluation metrics across 14 distinct video tasks categorized into 3 main categories. Additionally, we offer an in-depth performance analysis of these models for the 6 most common video tasks. We categorize ViFMs into three categories: 1) Image-based ViFMs, which adapt existing image models for video tasks, 2) Video-Based ViFMs, which utilize video-specific encoding methods, and 3) Universal Foundational Models (UFMs), which combine multiple modalities (image, video, audio, and text etc.) within a single framework. By comparing the performance of various ViFMs on different tasks, this survey offers valuable insights into their strengths and weaknesses, guiding future advancements in video understanding. Our analysis surprisingly reveals that image-based foundation models consistently outperform video-based models on most video understanding tasks. Additionally, UFMs, which leverage diverse modalities, demonstrate superior performance on video tasks. We share the comprehensive list of ViFMs studied in this work at: \url{https://github.com/NeeluMadan/ViFM_Survey.git}

Foundation Models for Video Understanding: A Survey

TL;DR

This survey addresses the need for a unified view of Video Foundation Models (ViFMs) by categorizing them into image-based, video-based, and universal multimodal models and evaluating them across 14 video tasks. It highlights that image-based ViFMs often outperform dedicated video models on many tasks, while UFMs with multiple modalities yield the strongest performance on video understanding, particularly in retrieval and cross-modal tasks. The work synthesizes pretraining strategies, architectures, loss functions, and datasets, offering actionable insights on inflation techniques (post-pretraining, adapters, prompts) and guidance for future research in scalable, ethical, and efficient ViFMs. Overall, the survey provides a roadmap for developing robust, generalizable video understanding systems that leverage large-scale multimodal pretraining and flexible deployment strategies.

Abstract

Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs achieve this by capturing robust and generic features from video data. This survey analyzes over 200 video foundational models, offering a comprehensive overview of benchmarks and evaluation metrics across 14 distinct video tasks categorized into 3 main categories. Additionally, we offer an in-depth performance analysis of these models for the 6 most common video tasks. We categorize ViFMs into three categories: 1) Image-based ViFMs, which adapt existing image models for video tasks, 2) Video-Based ViFMs, which utilize video-specific encoding methods, and 3) Universal Foundational Models (UFMs), which combine multiple modalities (image, video, audio, and text etc.) within a single framework. By comparing the performance of various ViFMs on different tasks, this survey offers valuable insights into their strengths and weaknesses, guiding future advancements in video understanding. Our analysis surprisingly reveals that image-based foundation models consistently outperform video-based models on most video understanding tasks. Additionally, UFMs, which leverage diverse modalities, demonstrate superior performance on video tasks. We share the comprehensive list of ViFMs studied in this work at: \url{https://github.com/NeeluMadan/ViFM_Survey.git}
Paper Structure (54 sections, 8 figures, 9 tables)

This paper contains 54 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Overview of recent research trends in video understanding. The left bar chart shows a significant increase in publications on this topic, based on data from prestigious conferences and journals. The figure presents statistics showcasing research focusing on generative, discriminative, and hybrid pretraining objectives, as depicted in the center pie chart. Specific pretraining objectives such as Masked Data Modeling (MDM), Masked Language Modeling (MLM), Vision-Text Contrastive (VTC), Vision-Audio Contrastive (VAC), Vision-Text Matching (VTM), Vision-Text Alignment (VTA), Captioning Loss (CAP), and Distillation Loss (Distill) are highlighted in the right pie chart. Best viewed in color.
  • Figure 2: Figure contrasts classical (separate feature extraction, model training) and deep learning (unified framework) approaches in computer vision. It also shows the progression of deep learning approaches for both image and video processing over time (best viewed in color).
  • Figure 3: Figure presents different video tasks. Task in first column: (a) Video Action Recognition, b) Temporal Action Localization (TAL), and c) Spatio-temporal Action Localization (STAL) require only video understanding. Tasks in second and third column: d) Video-Text Retrieval, e) VideoQA, and f) Video Captioning requires both video and language understanding. Best viewed in color.
  • Figure 4: Figure shows different architectures adopted by Video Foundation Models (ViFMs): Uni-modal ViFMs follows usually (a) Encoder-Decoder network, and multi-modal foundation model follows either (b) Joint-Encoder (c) Dual-Encoder (n==2) or Multi-Encoder (n$>$2), and (d) Mix-Encoder. Best viewed in color.
  • Figure 5: We need to kinda change the classification categories after the level 1(Type): a) Adapting Image Models: Post-pretraining (3.1), Adapters (3.2), and Prompt-tuning (3.3); b) Direct Video Models: Generative (4.1), Discriminative (4.2), and Hybrid (4.3) c) Joint Image-Video Models: Generative (5.1), Discriminative (5.2), Hybrid (5.3), d Generative (6.1) and Conversational (6.2)
  • ...and 3 more figures