BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

Haiquan Wen; Tianxiao Li; Zhenglin Huang; Yiwei He; Guangliang Cheng

BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, Guangliang Cheng

TL;DR

This work tackles the misinformation risk from AI-generated visual content by proposing a unified cross-modal detection framework, BusterX++, that jointly analyzes images and videos with natural-language explanations. It leverages direct reinforcement learning and a unified image–video training regime, paired with GenBuster++, a high-quality 4,000-sample cross-modal benchmark, to evaluate reasoning-driven detection. Experiments show that RL-based, multi-stage, unified training improves generalization and robustness across modalities and outperforms SFT-based baselines, while providing interpretable explanations assessed via user studies. Overall, the paper advances a language-centered, interpretable approach to multimodal fake content detection and establishes a standardized benchmark for unified evaluation.

Abstract

Recent advances in generative AI have dramatically improved image and video synthesis capabilities, significantly increasing the risk of misinformation through sophisticated fake content. In response, detection methods have evolved from traditional approaches to multimodal large language models (MLLMs), offering enhanced transparency and interpretability in identifying synthetic media. However, current detection systems remain fundamentally limited by their single-modality design. These approaches analyze images or videos separately, making them ineffective against synthetic content that combines multiple media formats. To address these challenges, we introduce \textbf{BusterX++}, a framework for unified detection and explanation of synthetic image and video, with a direct reinforcement learning (RL) post-training strategy. To enable comprehensive evaluation, we also present \textbf{GenBuster++}, a unified benchmark leveraging state-of-the-art image and video generation techniques. This benchmark comprises 4,000 images and video clips, meticulously curated by human experts to ensure high quality, diversity, and real-world applicability. Extensive experiments demonstrate the effectiveness and generalizability of our approach.

BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

TL;DR

Abstract

BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)