Table of Contents
Fetching ...

VEU-Bench: Towards Comprehensive Understanding of Video Editing

Bozheng Li, Yongliang Wu, Yi Lu, Jiashuo Yu, Licheng Tang, Jiawang Cao, Wenqing Zhu, Yuyang Sun, Jay Wu, Wenbo Zhu

TL;DR

Oscars1, a VEU expert model fine-tuned on the curated VEU-Bench dataset is developed, which outperforms existing open-source Vid-LLMs on VEU-Bench by over 28.3% in accuracy and achieves performance comparable to commercial models like GPT-4o.

Abstract

Widely shared videos on the internet are often edited. Recently, although Video Large Language Models (Vid-LLMs) have made great progress in general video understanding tasks, their capabilities in video editing understanding (VEU) tasks remain unexplored. To address this gap, in this paper, we introduce VEU-Bench (Video Editing Understanding Benchmark), a comprehensive benchmark that categorizes video editing components across various dimensions, from intra-frame features like shot size to inter-shot attributes such as cut types and transitions. Unlike previous video editing understanding benchmarks that focus mainly on editing element classification, VEU-Bench encompasses 19 fine-grained tasks across three stages: recognition, reasoning, and judging. To enhance the annotation of VEU automatically, we built an annotation pipeline integrated with an ontology-based knowledge base. Through extensive experiments with 11 state-of-the-art Vid-LLMs, our findings reveal that current Vid-LLMs face significant challenges in VEU tasks, with some performing worse than random choice. To alleviate this issue, we develop Oscars, a VEU expert model fine-tuned on the curated VEU-Bench dataset. It outperforms existing open-source Vid-LLMs on VEU-Bench by over 28.3% in accuracy and achieves performance comparable to commercial models like GPT-4o. We also demonstrate that incorporating VEU data significantly enhances the performance of Vid-LLMs on general video understanding benchmarks, with an average improvement of 8.3% across nine reasoning tasks.

VEU-Bench: Towards Comprehensive Understanding of Video Editing

TL;DR

Oscars1, a VEU expert model fine-tuned on the curated VEU-Bench dataset is developed, which outperforms existing open-source Vid-LLMs on VEU-Bench by over 28.3% in accuracy and achieves performance comparable to commercial models like GPT-4o.

Abstract

Widely shared videos on the internet are often edited. Recently, although Video Large Language Models (Vid-LLMs) have made great progress in general video understanding tasks, their capabilities in video editing understanding (VEU) tasks remain unexplored. To address this gap, in this paper, we introduce VEU-Bench (Video Editing Understanding Benchmark), a comprehensive benchmark that categorizes video editing components across various dimensions, from intra-frame features like shot size to inter-shot attributes such as cut types and transitions. Unlike previous video editing understanding benchmarks that focus mainly on editing element classification, VEU-Bench encompasses 19 fine-grained tasks across three stages: recognition, reasoning, and judging. To enhance the annotation of VEU automatically, we built an annotation pipeline integrated with an ontology-based knowledge base. Through extensive experiments with 11 state-of-the-art Vid-LLMs, our findings reveal that current Vid-LLMs face significant challenges in VEU tasks, with some performing worse than random choice. To alleviate this issue, we develop Oscars, a VEU expert model fine-tuned on the curated VEU-Bench dataset. It outperforms existing open-source Vid-LLMs on VEU-Bench by over 28.3% in accuracy and achieves performance comparable to commercial models like GPT-4o. We also demonstrate that incorporating VEU data significantly enhances the performance of Vid-LLMs on general video understanding benchmarks, with an average improvement of 8.3% across nine reasoning tasks.

Paper Structure

This paper contains 27 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The overview of our proposed VEU-Bench. VEU-Bench covers 10 editing dimensions, evaluating models on tasks ranging from recognition to reasoning and judging, providing a robust evaluation of video editing understanding across various aspects and levels of difficulty.
  • Figure 2: The performance of 11 Vid-LLMs and the proposed expert model, Oscars, on VEU-Bench. We normalize the results per dimension for clearer comparisons.
  • Figure 3: The static of our proposed VEU-Bench.
  • Figure 4: The overview of our data annotation pipeline. (a) shows the data annotation process for reasoning and judging tasks. Based on an established knowledge base, the annotator selects the most relevant attribute or function and reformulates a video-specific answer to create the QA pair. (b) Indicate evaluation mechanism, the response is matched against the corresponding abstract feature in the knowledge base, as well as compared with the annotated answer to calculate an overall score.
  • Figure 5: Ablation results of prompt designs.We conduct experiments on Qwen2-VL-7B, VideoLLaMA2-7B and Gemini-1.5-Pro.
  • ...and 2 more figures