Table of Contents
Fetching ...

VideoGLUE: Video General Understanding Evaluation of Foundation Models

Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong

TL;DR

The need and tremendous opportunities to conduct research on video-focused FMs are revealed, and both tasks and adaptation methods matter when it comes to the evaluation of FMs, which confirms that both tasks and adaptation methods matter.

Abstract

We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition,temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring an FM for downstream tasks. Furthermore, we jointly profile FMs' efficacy and efficiency when adapting to general video understanding tasks using cost measurements during both training and inference. Our main findings areas follows. First, task-specialized models significantly outperform the seven FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second, video-native FMs, whose pretraining data mainly contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks (e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs. Our code is released under: https://github.com/tensorflow/models/tree/master/official/projects/videoglue.

VideoGLUE: Video General Understanding Evaluation of Foundation Models

TL;DR

The need and tremendous opportunities to conduct research on video-focused FMs are revealed, and both tasks and adaptation methods matter when it comes to the evaluation of FMs, which confirms that both tasks and adaptation methods matter.

Abstract

We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition,temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring an FM for downstream tasks. Furthermore, we jointly profile FMs' efficacy and efficiency when adapting to general video understanding tasks using cost measurements during both training and inference. Our main findings areas follows. First, task-specialized models significantly outperform the seven FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second, video-native FMs, whose pretraining data mainly contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks (e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs. Our code is released under: https://github.com/tensorflow/models/tree/master/official/projects/videoglue.
Paper Structure (34 sections, 3 equations, 5 figures, 14 tables)

This paper contains 34 sections, 3 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Performance of FMs with end-to-end finetuning (red) and frozen backbone (blue), in comparison with state-of-the-art task-specialized models (black) on VideoGLUE benchmarks. VC-A, VC-M, and VC-ML stand for appearance-focused, motion-focused, and multi-label Video Classification tasks, respectively; TAL stands for Temporal Action Localization; STAL stands for Spatiotemporal Action Localization. The highest and lowest performance numbers on each dataset are mapped to $0.9$ and $0.1$, and the other numbers are linearly scaled accordingly on the radar chart. We also use gray shades to represent tasks that are more focused on appearance understanding more than motion. We observe that: (1) FMs generally fall behind task-specialized models; (2) FMs that are trained with video data are generally better than image-native FMs on motion-focused tasks under the frozen backbone setting, and image-native FMs can generally catch up when finetuned end-to-end on the target dataset.
  • Figure 2: We study four adaptation methods to apply a foundation model (FM) to video understanding downstream tasks: (a) end-to-end finetuning, (b) frozen backbone, (c) frozen backbone with multi-layer attention pooler (MLAP), and (d) a low-rank adapter.
  • Figure 3: (a) We measures the training (red diamond) and inference (blue square) cost of different adaptation methods in terms of number of trainable parameters and inference FLOPs, respectively. (b) We report VideoGLUE Score that combines a FM's performance weighted by its training costs with different adaptation methods for all the image-native (red circle) and video-native (blue pentagon) models.
  • Figure 4: (a) Single-layer pooler head and (b) multi-layer attention pooling head for video classification and spatiotemporal action localization.
  • Figure 5: The adapter used in vision transformer. In the adapter layer, only the down-sample layer, up-sample layer, and the scaling factor are tunable. Between the down-sample layer and up-sample layer, an activation function is applied, which in our case is ReLU.