Table of Contents
Fetching ...

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model

Xinhao Li, Zhenpeng Huang, Jing Wang, Kunchang Li, Limin Wang

TL;DR

VideoEval introduces VidTAB and VidEB, a vision-centric, low-cost benchmark suite to evaluate video foundation models beyond traditional action-recognition tasks. By combining few-shot task adaptation (VidTAB) with embedding-based evaluations (VidEB), it enables a comprehensive assessment of VFMs' generalization and representation power across diverse domains. A large-scale study across 20 open-source VFMs reveals limited cross-task generalization, nuanced effects of data scale and pre-training paradigms, and benefits from combining pretraining strategies, including image-to-video adaptation. The benchmark provides practical guidance for evaluating VFMs, highlighting efficient adaptation methods and the conditions under which certain pretraining approaches transfer best to downstream tasks.

Abstract

With the growth of high-quality data and advancement in visual pre-training paradigms, Video Foundation Models (VFMs) have made significant progress recently, demonstrating their remarkable performance on traditional video understanding benchmarks. However, the existing benchmarks (e.g. Kinetics) and their evaluation protocols are often limited by relatively poor diversity, high evaluation costs, and saturated performance metrics. In this paper, we build a comprehensive benchmark suite to address these issues, namely VideoEval. Specifically, we establish the Video Task Adaption Benchmark (VidTAB) and the Video Embedding Benchmark (VidEB) from two perspectives: evaluating the task adaptability of VFMs under few-shot conditions and assessing their representation power by directly applying to downstream tasks. With VideoEval, we conduct a large-scale study on 20 popular open-source vision foundation models. Our study reveals some insightful findings on VFMs: 1) overall, current VFMs exhibit weak generalization across diverse tasks, 2) increasing video data, whether labeled or weakly-labeled video-text pairs, does not necessarily improve task performance, 3) the effectiveness of some pre-training paradigms may not be fully validated in previous benchmarks, and 4) combining different pre-training paradigms can help improve the generalization capabilities. We believe this study serves as an important complement to the current evaluation for VFMs and offers valuable insights for the future research.

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model

TL;DR

VideoEval introduces VidTAB and VidEB, a vision-centric, low-cost benchmark suite to evaluate video foundation models beyond traditional action-recognition tasks. By combining few-shot task adaptation (VidTAB) with embedding-based evaluations (VidEB), it enables a comprehensive assessment of VFMs' generalization and representation power across diverse domains. A large-scale study across 20 open-source VFMs reveals limited cross-task generalization, nuanced effects of data scale and pre-training paradigms, and benefits from combining pretraining strategies, including image-to-video adaptation. The benchmark provides practical guidance for evaluating VFMs, highlighting efficient adaptation methods and the conditions under which certain pretraining approaches transfer best to downstream tasks.

Abstract

With the growth of high-quality data and advancement in visual pre-training paradigms, Video Foundation Models (VFMs) have made significant progress recently, demonstrating their remarkable performance on traditional video understanding benchmarks. However, the existing benchmarks (e.g. Kinetics) and their evaluation protocols are often limited by relatively poor diversity, high evaluation costs, and saturated performance metrics. In this paper, we build a comprehensive benchmark suite to address these issues, namely VideoEval. Specifically, we establish the Video Task Adaption Benchmark (VidTAB) and the Video Embedding Benchmark (VidEB) from two perspectives: evaluating the task adaptability of VFMs under few-shot conditions and assessing their representation power by directly applying to downstream tasks. With VideoEval, we conduct a large-scale study on 20 popular open-source vision foundation models. Our study reveals some insightful findings on VFMs: 1) overall, current VFMs exhibit weak generalization across diverse tasks, 2) increasing video data, whether labeled or weakly-labeled video-text pairs, does not necessarily improve task performance, 3) the effectiveness of some pre-training paradigms may not be fully validated in previous benchmarks, and 4) combining different pre-training paradigms can help improve the generalization capabilities. We believe this study serves as an important complement to the current evaluation for VFMs and offers valuable insights for the future research.
Paper Structure (33 sections, 1 equation, 5 figures, 5 tables)

This paper contains 33 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of VideoEval. We propose a novel, vision-centric evaluation method for video foundation models that is comprehensive, challenging, indicative, and low-cost.
  • Figure 2: Illustration of building VideoEval. We build VideoEval through the following steps: (1) conducting task selection by considering our expected capabilities for VFMs. (2) performing data filtration from the perspectives of quality, difficulty, and diversity, (3) standardizing the task format through task construction, and (4) defining evaluation methods and targets.
  • Figure 3: Performance comparison on different training data scales. We evaluate the performance variation of multiple video foundation models across tasks from two different domains as the scale of the training data changed. 'FT' and 'AP' denote full finetuning and attentive probe, respectively.
  • Figure 4: Illustration of different adaptation method: (a) Full Finetuning, (b) Adapter, (c) Attentive Probe, and (d) Linear Probe.
  • Figure 5: Examples of VidTAB. We present video examples for each task in VidTAB, demonstrating that successfully completing these tasks requires VFMs to possess strong generalization capabilities.