Table of Contents
Fetching ...

DeVAn: Dense Video Annotation for Video-Language Models

Tingkai Liu, Yunzhe Tao, Haogeng Liu, Qihang Fan, Ding Zhou, Huaibo Huang, Ran He, Hongxia Yang

TL;DR

DeVAn introduces a dense, long-form video annotation dataset aimed at evaluating video-language models on both short captions and multi-sentence summaries grounded in audio-visual content. The dataset comprises 8.5K YouTube clips with 5 human-generated captions and 5 summaries per clip, accompanied by a training set of 100K ASR-rich segments. The authors compare multiple model families—frozen-LLM adapters and an end-to-end VideoCoCa-based model with an ASR encoder—across generation and retrieval tasks, and demonstrate that model-based metrics like BLEURT better reflect human preferences for long-form summaries, while audio content improves performance. The work provides detailed data collection, evaluation protocols, and baselines, offering a practical benchmark for advancing dense video understanding in the era of large language models.

Abstract

We present a novel human annotated dataset for evaluating the ability for visual-language models to generate both short and long descriptions for real-world video clips, termed DeVAn (Dense Video Annotation). The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests. Each video clip is independently annotated by 5 human annotators, producing both captions (1 sentence) and summaries (3-10 sentences). Given any video selected from the dataset and its corresponding ASR information, we evaluate visuallanguage models on either caption or summary generation that is grounded in both the visual and auditory content of the video. Additionally, models are also evaluated on caption- and summary-based retrieval tasks, where the summary-based retrieval task requires the identification of a target video given excerpts of a given summary. Given the novel nature of the paragraph-length video summarization task, we compared different existing evaluation metrics and their alignment with human preferences and found that model-based evaluation metrics provide more semantically-oriented and human-aligned evaluation. Finally, we benchmarked a wide range of current video-language models on DeVAn, and we aim for DeVAn to serve as a useful evaluation set in the age of large language models and complex multi-modal tasks. Code is available at https: //github.com/TK-21st/DeVAn.

DeVAn: Dense Video Annotation for Video-Language Models

TL;DR

DeVAn introduces a dense, long-form video annotation dataset aimed at evaluating video-language models on both short captions and multi-sentence summaries grounded in audio-visual content. The dataset comprises 8.5K YouTube clips with 5 human-generated captions and 5 summaries per clip, accompanied by a training set of 100K ASR-rich segments. The authors compare multiple model families—frozen-LLM adapters and an end-to-end VideoCoCa-based model with an ASR encoder—across generation and retrieval tasks, and demonstrate that model-based metrics like BLEURT better reflect human preferences for long-form summaries, while audio content improves performance. The work provides detailed data collection, evaluation protocols, and baselines, offering a practical benchmark for advancing dense video understanding in the era of large language models.

Abstract

We present a novel human annotated dataset for evaluating the ability for visual-language models to generate both short and long descriptions for real-world video clips, termed DeVAn (Dense Video Annotation). The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests. Each video clip is independently annotated by 5 human annotators, producing both captions (1 sentence) and summaries (3-10 sentences). Given any video selected from the dataset and its corresponding ASR information, we evaluate visuallanguage models on either caption or summary generation that is grounded in both the visual and auditory content of the video. Additionally, models are also evaluated on caption- and summary-based retrieval tasks, where the summary-based retrieval task requires the identification of a target video given excerpts of a given summary. Given the novel nature of the paragraph-length video summarization task, we compared different existing evaluation metrics and their alignment with human preferences and found that model-based evaluation metrics provide more semantically-oriented and human-aligned evaluation. Finally, we benchmarked a wide range of current video-language models on DeVAn, and we aim for DeVAn to serve as a useful evaluation set in the age of large language models and complex multi-modal tasks. Code is available at https: //github.com/TK-21st/DeVAn.
Paper Structure (32 sections, 24 figures, 6 tables)

This paper contains 32 sections, 24 figures, 6 tables.

Figures (24)

  • Figure 1: Example of DeVAn dataset. For each video, 5 captions and 5 summaries are independently annotated based on both visual and auditory information of the selected videos. Text-to-video retrievals from video summaries are performed by randomly sampling single-sentence excerpts.
  • Figure 2: Diversity of DeVAn dataset. Our dataset contains English videos covering a diverse range of topics uploaded across the past 17 years.
  • Figure 3: Diversity of training dataset. Our training dataset contains captions and summaries for 100K ASR-rich video segments. Note that as opposed to test dataset in Figure \ref{['fig:data_diversity']}, the ASR Number of Words normalized by video duration does not have a significant concentration around 0, indicating that all videos in the training dataset contains a significant amount of ASR information.
  • Figure 4: Distribution of Number of Words in Captions and Summaries. Note that for legibility, only distributions for models without audio signals were shown. However, we found that the distributions of caption and summary lengths do not vary significantly with the introduction of audio signals.
  • Figure 5: End-to-End Model Architecture. Our End-to-End model combines VideoCoCa model architecture with an additional ASR encoder. Frame-level embeddings of VideoCoCa and ASR embeddings are concatenated before passing through the Attention Pooler. Note that the contrastive loss is computed using the first output embedding of the Attention Pooler.
  • ...and 19 more figures