Table of Contents
Fetching ...

Aetheria: A multimodal interpretable content safety framework based on multi-agent debate and collaboration

Yuxiang He, Jian Zhao, Yuchen Yuan, Tianle Zhang, Wei Cai, Haojie Cheng, Ziyan Shi, Ming Zhu, Haichuan Tang, Chi Zhang, Xuelong Li

TL;DR

Aetheria introduces a multimodal content-safety framework built on five specialized agents that engage in grounded, adversarial debate to detect implicit risks. The architecture combines retrieval-augmented grounding, a hierarchical adjudication protocol, and a memory-based continuous learning loop to produce transparent audit logs. Empirical evaluations on AIR-Bench show superior accuracy, especially for nuanced cross-modal risks, compared with commercial and open-source baselines. The work advances trustworthy AI moderation by delivering interpretable judgments and a scalable, knowledge-grounded reasoning process suitable for high-stakes content safety scenarios.

Abstract

The exponential growth of digital content presents significant challenges for content safety. Current moderation systems, often based on single models or fixed pipelines, exhibit limitations in identifying implicit risks and providing interpretable judgment processes. To address these issues, we propose Aetheria, a multimodal interpretable content safety framework based on multi-agent debate and collaboration.Employing a collaborative architecture of five core agents, Aetheria conducts in-depth analysis and adjudication of multimodal content through a dynamic, mutually persuasive debate mechanism, which is grounded by RAG-based knowledge retrieval.Comprehensive experiments on our proposed benchmark (AIR-Bench) validate that Aetheria not only generates detailed and traceable audit reports but also demonstrates significant advantages over baselines in overall content safety accuracy, especially in the identification of implicit risks. This framework establishes a transparent and interpretable paradigm, significantly advancing the field of trustworthy AI content moderation.

Aetheria: A multimodal interpretable content safety framework based on multi-agent debate and collaboration

TL;DR

Aetheria introduces a multimodal content-safety framework built on five specialized agents that engage in grounded, adversarial debate to detect implicit risks. The architecture combines retrieval-augmented grounding, a hierarchical adjudication protocol, and a memory-based continuous learning loop to produce transparent audit logs. Empirical evaluations on AIR-Bench show superior accuracy, especially for nuanced cross-modal risks, compared with commercial and open-source baselines. The work advances trustworthy AI moderation by delivering interpretable judgments and a scalable, knowledge-grounded reasoning process suitable for high-stakes content safety scenarios.

Abstract

The exponential growth of digital content presents significant challenges for content safety. Current moderation systems, often based on single models or fixed pipelines, exhibit limitations in identifying implicit risks and providing interpretable judgment processes. To address these issues, we propose Aetheria, a multimodal interpretable content safety framework based on multi-agent debate and collaboration.Employing a collaborative architecture of five core agents, Aetheria conducts in-depth analysis and adjudication of multimodal content through a dynamic, mutually persuasive debate mechanism, which is grounded by RAG-based knowledge retrieval.Comprehensive experiments on our proposed benchmark (AIR-Bench) validate that Aetheria not only generates detailed and traceable audit reports but also demonstrates significant advantages over baselines in overall content safety accuracy, especially in the identification of implicit risks. This framework establishes a transparent and interpretable paradigm, significantly advancing the field of trustworthy AI content moderation.

Paper Structure

This paper contains 42 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the Aetheria Framework Architecture. The pipeline consists of three online phases and one offline loop: (1) Context & Grounding, where multimodal inputs are standardized by the Preprocessor and grounded with historical precedents via the Supporter (RAG); (2) The Debate Arena, which facilitates an adversarial multi-round dialogue between a risk-averse Strict Debater and a context-aware Loose Debater; (3) Adjudication, where the Arbiter derives a transparent verdict using a Hierarchical Adjudication Protocol. Additionally, an offline Meta-Learning Loop continuously refines the Case Library by retrieving samples from the Log Database and extracting key cues to improve future reasoning.
  • Figure 2: Overview of the AIR-Bench Construction and Statistics. (a) Adversarial Curation Pipeline: The data undergoes a rigorous "Difficulty Screening" by 8 baseline models followed by expert arbitration. (b) Negative Skew Distribution: We intentionally introduce a positive class skew (43.33%) to penalize single-modality bias. (c) Risk Taxonomy: The benchmark covers 12 distinct risk categories including Bias, Hate, and Network Attacks.
  • Figure 3: Comparative performance trajectory across sequential batches. The experimental group demonstrates a clear upward trend as the knowledge base expands (simulating a high-density feedback loop), significantly outperforming the memory-less baseline.
  • Figure 4: Case 1: Dangerous Chemical Interaction. Aetheria detects implicit physical risks. Note the Concession in Round 2 where safety evidence overrides benign intent.
  • Figure 5: Case 2: Implicit Hate Speech. Aetheria detects dehumanization where benign visual activities are re-contextualized by hostile text.