Table of Contents
Fetching ...

A Benchmarking Methodology to Assess Open-Source Video Large Language Models in Automatic Captioning of News Videos

David Miranda Paredes, Jose M. Saavedra, Marcelo Pizarro

Abstract

News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datasets: a Chilean TV news corpus (approximately 1,345 clips) and a BBC News corpus (9,838 clips). We employ lexical metrics (METEOR, ROUGE-L), semantic metrics (BERTScore, CLIPScore, Text Similarity, Mean Reciprocal Rank), and two novel fidelity metrics proposed in this work: the Thematic Fidelity Score (TFS) and Entity Fidelity Score (EFS). Our analysis reveals that standard metrics exhibit limited discriminative power for news video captioning due to surface-form dependence, static-frame insensitivity, and function-word inflation. TFS and EFS address these gaps by directly assessing thematic structure preservation and named-entity coverage in the generated captions. Results show that Gemma~3 achieves the highest overall performance across both datasets and most evaluation dimensions, with Qwen-VL as a consistent runner-up.

A Benchmarking Methodology to Assess Open-Source Video Large Language Models in Automatic Captioning of News Videos

Abstract

News videos are among the most prevalent content types produced by television stations and online streaming platforms, yet generating textual descriptions to facilitate indexing and retrieval largely remains a manual process. Video Large Language Models (VidLLMs) offer significant potential to automate this task, but a comprehensive evaluation in the news domain is still lacking. This work presents a comparative study of eight state-of-the-art open-source VidLLMs for automatic news video captioning, evaluated on two complementary benchmark datasets: a Chilean TV news corpus (approximately 1,345 clips) and a BBC News corpus (9,838 clips). We employ lexical metrics (METEOR, ROUGE-L), semantic metrics (BERTScore, CLIPScore, Text Similarity, Mean Reciprocal Rank), and two novel fidelity metrics proposed in this work: the Thematic Fidelity Score (TFS) and Entity Fidelity Score (EFS). Our analysis reveals that standard metrics exhibit limited discriminative power for news video captioning due to surface-form dependence, static-frame insensitivity, and function-word inflation. TFS and EFS address these gaps by directly assessing thematic structure preservation and named-entity coverage in the generated captions. Results show that Gemma~3 achieves the highest overall performance across both datasets and most evaluation dimensions, with Qwen-VL as a consistent runner-up.

Paper Structure

This paper contains 37 sections, 2 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: General pipeline of modern video captioning systems based on vision-language models (VLMs). The process begins with the raw video input, which undergoes temporal frame sampling (uniform, adaptive, or keyframe-based). The selected frames are processed by a vision encoder (typically ViT, SigLIP, EVA-CLIP, or similar architectures) to extract visual features. These features are then mapped into the language model's embedding space via a cross-modal projector (usually an MLP adapter). Finally, the multimodal LLM (such as Qwen2-VL, LLaVA-series, InternVL, among others) integrates the visual tokens with the textual prompt and autoregressively generates the natural language caption.
  • Figure 2: A sample of video content from the ChTv dataset. ChTv was downloaded from https://www.13.cl/.
  • Figure 3: Distribution of Video Clip durations and their description lengths (number of words) within the ChTv Dataset.
  • Figure 4: Distribution of 'Thematic Descriptors' in the ChTv Dataset.
  • Figure 5: A sample of video content from the BBC dataset. BBC was downloaded from https://www.bbc.com/.
  • ...and 5 more figures