Table of Contents
Fetching ...

Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

Lu Xu, Sijie Zhu, Chunyuan Li, Chia-Wen Kuo, Fan Chen, Xinyao Wang, Guang Chen, Dawei Du, Ye Yuan, Longyin Wen

TL;DR

EditVid-QA tackles the problem of understanding edited videos on social media with large multimodal models by introducing a new VQA benchmark and rectified evaluation metrics. It builds two high-quality training data sources, EditedVideo2K and Panda-WebVid30K, to boost model generalization to editing patterns, memes, and temporal reasoning. The study shows that current open-source 7B LMMs underperform EditVid-QA while GPT-4V excels, and that data augmentation yields consistent gains across categories. It also reveals biases in GPT-3.5-based judging and provides practical guidance for reliable evaluation and training data collection.

Abstract

The emerging video LMMs (Large Multimodal Models) have achieved significant improvements on generic video understanding in the form of VQA (Visual Question Answering), where the raw videos are captured by cameras. However, a large portion of videos in real-world applications are edited videos, \textit{e.g.}, users usually cut and add effects/modifications to the raw video before publishing it on social media platforms. The edited videos usually have high view counts but they are not covered in existing benchmarks of video LMMs, \textit{i.e.}, ActivityNet-QA, or VideoChatGPT benchmark. In this paper, we leverage the edited videos on a popular short video platform, \textit{i.e.}, TikTok, and build a video VQA benchmark (named EditVid-QA) covering four typical editing categories, i.e., effect, funny, meme, and game. Funny and meme videos benchmark nuanced understanding and high-level reasoning, while effect and game evaluate the understanding capability of artificial design. Most of the open-source video LMMs perform poorly on the EditVid-QA benchmark, indicating a huge domain gap between edited short videos on social media and regular raw videos. To improve the generalization ability of LMMs, we collect a training set for the proposed benchmark based on both Panda-70M/WebVid raw videos and small-scale TikTok/CapCut edited videos, which boosts the performance on the proposed EditVid-QA benchmark, indicating the effectiveness of high-quality training data. We also identified a serious issue in the existing evaluation protocol using the GPT-3.5 judge, namely a "sorry" attack, where a sorry-style naive answer can achieve an extremely high rating from the GPT judge, e.g., over 4.3 for correctness score on VideoChatGPT evaluation protocol. To avoid the "sorry" attacks, we evaluate results with GPT-4 judge and keyword filtering. The dataset is released at https://github.com/XenonLamb/EditVid-QA.

Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

TL;DR

EditVid-QA tackles the problem of understanding edited videos on social media with large multimodal models by introducing a new VQA benchmark and rectified evaluation metrics. It builds two high-quality training data sources, EditedVideo2K and Panda-WebVid30K, to boost model generalization to editing patterns, memes, and temporal reasoning. The study shows that current open-source 7B LMMs underperform EditVid-QA while GPT-4V excels, and that data augmentation yields consistent gains across categories. It also reveals biases in GPT-3.5-based judging and provides practical guidance for reliable evaluation and training data collection.

Abstract

The emerging video LMMs (Large Multimodal Models) have achieved significant improvements on generic video understanding in the form of VQA (Visual Question Answering), where the raw videos are captured by cameras. However, a large portion of videos in real-world applications are edited videos, \textit{e.g.}, users usually cut and add effects/modifications to the raw video before publishing it on social media platforms. The edited videos usually have high view counts but they are not covered in existing benchmarks of video LMMs, \textit{i.e.}, ActivityNet-QA, or VideoChatGPT benchmark. In this paper, we leverage the edited videos on a popular short video platform, \textit{i.e.}, TikTok, and build a video VQA benchmark (named EditVid-QA) covering four typical editing categories, i.e., effect, funny, meme, and game. Funny and meme videos benchmark nuanced understanding and high-level reasoning, while effect and game evaluate the understanding capability of artificial design. Most of the open-source video LMMs perform poorly on the EditVid-QA benchmark, indicating a huge domain gap between edited short videos on social media and regular raw videos. To improve the generalization ability of LMMs, we collect a training set for the proposed benchmark based on both Panda-70M/WebVid raw videos and small-scale TikTok/CapCut edited videos, which boosts the performance on the proposed EditVid-QA benchmark, indicating the effectiveness of high-quality training data. We also identified a serious issue in the existing evaluation protocol using the GPT-3.5 judge, namely a "sorry" attack, where a sorry-style naive answer can achieve an extremely high rating from the GPT judge, e.g., over 4.3 for correctness score on VideoChatGPT evaluation protocol. To avoid the "sorry" attacks, we evaluate results with GPT-4 judge and keyword filtering. The dataset is released at https://github.com/XenonLamb/EditVid-QA.
Paper Structure (12 sections, 3 figures, 5 tables)

This paper contains 12 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Example video frames and QA pairs of the proposed EditVid-QA dataset. Watermarks are removed for anonymity. Best viewed on the screen with zoom-in.
  • Figure 2: Comparison between GPT-3.5 and GPT-4 judge on VideoChatGPT dataset maaz2023video. The two GPT judges could be inconsistent for some cases.
  • Figure 3: Qualitative results of our model and GPT-4V on the proposed EditVid-QA benchmark. Watermarks are removed for anonymity.