Table of Contents
Fetching ...

GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

Yueying Zou, Pei Pei Li, Zekun Li, Xinyu Guo, Xing Cui, Huaibo Huang, Ran He

Abstract

In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.

GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

Abstract

In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.
Paper Structure (32 sections, 26 figures, 5 tables)

This paper contains 32 sections, 26 figures, 5 tables.

Figures (26)

  • Figure 1: Overview of GenVideoLens. Existing evaluations treat AI-generated video detection as a binary classification task, providing limited insight into the capabilities of LVLMs. GenVideoLens addresses this limitation by decomposing video authenticity into 15 fine-grained dimensions across frame-level and video-level analyses, enabling diagnostic evaluation that reveals where LVLMs fail and guides for future improvement.
  • Figure 2: Visualization of the 15 authenticity dimensions in GenVideoLens. The dimensions are organized into frame-level and video-level categories, each illustrated with representative visual examples from the dataset to highlight the corresponding authenticity cues.
  • Figure 3: Dataset statistics of GenVideoLens. (a) Distribution of annotated artifacts across frame-level and video-level authenticity dimensions. (b) Correlation heatmap of the 15 authenticity dimensions computed from ground-truth annotations. (c) Word cloud of senmantic diversity in GenVideoLens
  • Figure 4: Overview of the GenVideoLens dataset construction and annotation pipeline. The process consists of data collection, dataset filtering, and human annotation.
  • Figure 5: Visualization of the inputs for our physical-causal reasoning experiment. (a)(b) Consecutive frames showing a basketball approaching the hoop. (c) Frame difference map reveals the motion trajectory, and (d) optical-flow map captures directional velocity fields. Results shown with Intern3.5VL-8B.
  • ...and 21 more figures