Table of Contents
Fetching ...

LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection

Qingyuan Liu, Yun-Yun Tsai, Ruijian Zha, Victoria Li, Pengyuan Shi, Chengzhi Mao, Junfeng Yang

TL;DR

LAVID introduces an agentic LVLM framework for diffusion-generated video detection that operates without training detectors. It automatically assembles an EK toolkit, selects tools with a combined objective/subjective score, and employs online adaptation of structured prompts to reduce hallucination and improve robustness. The approach is validated on VidForensic, a dataset of real and diffusion-generated videos, showing consistent F1 gains over baselines across multiple LVLMs, including GPT-4o with structured prompts. The work demonstrates that training-free, EK-guided reasoning in LVLMs can generalize across generators and reduce artifact misinterpretation, offering practical, scalable video forensics capabilities.

Abstract

The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works of AI-generated content detection have been widely studied in the image field (e.g., deepfake), yet the video field has been unexplored. Large Vision Language Model (LVLM) has become an emerging tool for AI-generated content detection for its strong reasoning and multimodal capabilities. It breaks the limitations of traditional deep learning based methods faced with like lack of transparency and inability to recognize new artifacts. Motivated by this, we propose LAVID, a novel LVLMs-based ai-generated video detection with explicit knowledge enhancement. Our insight list as follows: (1) The leading LVLMs can call external tools to extract useful information to facilitate its own video detection task; (2) Structuring the prompt can affect LVLM's reasoning ability to interpret information in video content. Our proposed pipeline automatically selects a set of explicit knowledge tools for detection, and then adaptively adjusts the structure prompt by self-rewriting. Different from prior SOTA that trains additional detectors, our method is fully training-free and only requires inference of the LVLM for detection. To facilitate our research, we also create a new benchmark \vidfor with high-quality videos generated from multiple sources of video generation tools. Evaluation results show that LAVID improves F1 scores by 6.2 to 30.2% over the top baselines on our datasets across four SOTA LVLMs.

LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection

TL;DR

LAVID introduces an agentic LVLM framework for diffusion-generated video detection that operates without training detectors. It automatically assembles an EK toolkit, selects tools with a combined objective/subjective score, and employs online adaptation of structured prompts to reduce hallucination and improve robustness. The approach is validated on VidForensic, a dataset of real and diffusion-generated videos, showing consistent F1 gains over baselines across multiple LVLMs, including GPT-4o with structured prompts. The work demonstrates that training-free, EK-guided reasoning in LVLMs can generalize across generators and reduce artifact misinterpretation, offering practical, scalable video forensics capabilities.

Abstract

The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works of AI-generated content detection have been widely studied in the image field (e.g., deepfake), yet the video field has been unexplored. Large Vision Language Model (LVLM) has become an emerging tool for AI-generated content detection for its strong reasoning and multimodal capabilities. It breaks the limitations of traditional deep learning based methods faced with like lack of transparency and inability to recognize new artifacts. Motivated by this, we propose LAVID, a novel LVLMs-based ai-generated video detection with explicit knowledge enhancement. Our insight list as follows: (1) The leading LVLMs can call external tools to extract useful information to facilitate its own video detection task; (2) Structuring the prompt can affect LVLM's reasoning ability to interpret information in video content. Our proposed pipeline automatically selects a set of explicit knowledge tools for detection, and then adaptively adjusts the structure prompt by self-rewriting. Different from prior SOTA that trains additional detectors, our method is fully training-free and only requires inference of the LVLM for detection. To facilitate our research, we also create a new benchmark \vidfor with high-quality videos generated from multiple sources of video generation tools. Evaluation results show that LAVID improves F1 scores by 6.2 to 30.2% over the top baselines on our datasets across four SOTA LVLMs.

Paper Structure

This paper contains 41 sections, 4 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: An example of AI-generated video from Kling klingai where LAVID makes a correct prediction with the explicit knowledge enhancement. LAVID will facilitate LVLMs for video detection by calling explicit knowledge tools to extract useful information from the original videos and providing structure-formated output.
  • Figure 2: An agentic framework (LAVID) for video detection. The left part shows our main pipeline. First, LVLMs suggest tools relevant to video detection, and based on the model's preferences and the performance improvement each tool provides, we assemble a customized toolkit for each LVLM for video detection. The right part shows the details of the online adaptation for structured prompt. The prompt tuning will be based on the LVLM itself. Component marked with the logo$~$ are developed with the LVLM like GPT-4o openai2024gpt4o.
  • Figure 3: Prompt example for LVLM
  • Figure 4: Comparison between supervised learning methods and LAVID. Both SVM and XGBoost are trained with the same EK of the LVLMs. (RAW) represents the results using raw frame only.
  • Figure 5: Heatmap of refusal rate for both non-structured and structured prompt on GPT-4o across different baselines and datasets
  • ...and 2 more figures