Table of Contents
Fetching ...

Survey on AI-Generated Media Detection: From Non-MLLM to MLLM

Yueying Zou, Peipei Li, Zekun Li, Huaibo Huang, Xing Cui, Xuannan Liu, Chenghanyu Zhang, Ran He

TL;DR

This survey addresses AI-generated media detection with a focus on the transition from domain-specific Non-MLLM detectors to general-purpose MLLM-based detectors. It provides a structured taxonomy across single-modal and multimodal tasks—authenticity, explainability, and localization—and compares methods, datasets, and evaluation metrics while highlighting ethical and regulatory implications. The work identifies key gaps, such as explainability and localization in video and audio, and discusses hybrid approaches that combine specialized detectors with generalized MLLMs. By detailing benchmarks, policy frameworks, and future directions, the paper offers a comprehensive foundation for researchers and policymakers to advance robust, transparent, and secure GenAI detection technologies.

Abstract

The proliferation of AI-generated media poses significant challenges to information authenticity and social trust, making reliable detection methods highly demanded. Methods for detecting AI-generated media have evolved rapidly, paralleling the advancement of Multimodal Large Language Models (MLLMs). Current detection approaches can be categorized into two main groups: Non-MLLM-based and MLLM-based methods. The former employs high-precision, domain-specific detectors powered by deep learning techniques, while the latter utilizes general-purpose detectors based on MLLMs that integrate authenticity verification, explainability, and localization capabilities. Despite significant progress in this field, there remains a gap in literature regarding a comprehensive survey that examines the transition from domain-specific to general-purpose detection methods. This paper addresses this gap by providing a systematic review of both approaches, analyzing them from single-modal and multi-modal perspectives. We present a detailed comparative analysis of these categories, examining their methodological similarities and differences. Through this analysis, we explore potential hybrid approaches and identify key challenges in forgery detection, providing direction for future research. Additionally, as MLLMs become increasingly prevalent in detection tasks, ethical and security considerations have emerged as critical global concerns. We examine the regulatory landscape surrounding Generative AI (GenAI) across various jurisdictions, offering valuable insights for researchers and practitioners in this field.

Survey on AI-Generated Media Detection: From Non-MLLM to MLLM

TL;DR

This survey addresses AI-generated media detection with a focus on the transition from domain-specific Non-MLLM detectors to general-purpose MLLM-based detectors. It provides a structured taxonomy across single-modal and multimodal tasks—authenticity, explainability, and localization—and compares methods, datasets, and evaluation metrics while highlighting ethical and regulatory implications. The work identifies key gaps, such as explainability and localization in video and audio, and discusses hybrid approaches that combine specialized detectors with generalized MLLMs. By detailing benchmarks, policy frameworks, and future directions, the paper offers a comprehensive foundation for researchers and policymakers to advance robust, transparent, and secure GenAI detection technologies.

Abstract

The proliferation of AI-generated media poses significant challenges to information authenticity and social trust, making reliable detection methods highly demanded. Methods for detecting AI-generated media have evolved rapidly, paralleling the advancement of Multimodal Large Language Models (MLLMs). Current detection approaches can be categorized into two main groups: Non-MLLM-based and MLLM-based methods. The former employs high-precision, domain-specific detectors powered by deep learning techniques, while the latter utilizes general-purpose detectors based on MLLMs that integrate authenticity verification, explainability, and localization capabilities. Despite significant progress in this field, there remains a gap in literature regarding a comprehensive survey that examines the transition from domain-specific to general-purpose detection methods. This paper addresses this gap by providing a systematic review of both approaches, analyzing them from single-modal and multi-modal perspectives. We present a detailed comparative analysis of these categories, examining their methodological similarities and differences. Through this analysis, we explore potential hybrid approaches and identify key challenges in forgery detection, providing direction for future research. Additionally, as MLLMs become increasingly prevalent in detection tasks, ethical and security considerations have emerged as critical global concerns. We examine the regulatory landscape surrounding Generative AI (GenAI) across various jurisdictions, offering valuable insights for researchers and practitioners in this field.

Paper Structure

This paper contains 56 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Survey at A Glance. (a) Input and Methods. This constitutes the core of our work. We categorize the inputs for AI-generated media detection into five distinct modalities, with task types including authenticity detection, explainability, and localization. We conduct an in-depth review of over 100 studies, classifying them into Non-MLLM detectors and MLLM detectors. (b) Benchmarking. We classify popular and emerging benchmarks based on task types—authenticity detection, explainability, and localization—and discuss them according to their modality-specific approaches. (c) Policies. We analyze and discuss the legal frameworks and scholarly debates across various countries, categorizing AI-generated media policy into initiatives, regulations, and blueprints. This section provides valuable insights for researchers in the field. (d) Future Trends. We explore how AI-generated media detection could benefit from broader modality support, advancements in MLLMs detection capabilities, and improvements in legal regulations. Some images are courtesy of online resources.
  • Figure 2: Illustrating of MLLM-based detection methodologies for AI-generated text
  • Figure 3: Illustrating of MLLM-based detection methodologies for AI-generated images. "Mask + Image → Text" approach is reproduced from li2024forgerygpt, "Text + Image → Mask" approach is reproduced from huang2024sida, and Independent Mask Localization method is adapted from lian2024large
  • Figure 4: Illustrating of MLLM-based detection methodologies for AI-generated Video and Audio
  • Figure 5: Illustrating of Non-MLLM-based authenticity detection methodologies for AI-generated images. The methods are categorized into: (a) Low-level (b) High-level (c) Reconstruction error (d) Watermarking, (d) is reproduced from luo2025digital