Table of Contents
Fetching ...

Multi-Agent Debate: A Unified Agentic Framework for Tabular Anomaly Detection

Pinqiao Wang, Sheng Li

TL;DR

Tabular anomaly detection is challenged by distribution shift, missing data, and rare events, where no single detector is reliably dominant. The authors propose MAD, a multi-agent debating framework in which heterogeneous detectors act as agents that emit a normalized score $\tilde{s}_i(x)\in[0,1]$, a confidence, and structured evidence, optionally reviewed by an LLM critic. A coordinator converts these messages into bounded losses via a synthesis operator $\Psi$ and updates agent influence with exponentiated-gradient, yielding a final debated score $\hat{s}(x)$ and an auditable debate trace; the system subsumes standard ensembles as special cases when restricting the message space. Theoretical regret guarantees hold for the synthesized losses, and conformal calibration can wrap the debated score to control false positives under exchangeability. Empirically, MAD improves robustness, calibration, and slice robustness across diverse tabular anomaly benchmarks, with diagnostics that reveal how disagreement is resolved and when it most benefits performance.

Abstract

Tabular anomaly detection is often handled by single detectors or static ensembles, even though strong performance on tabular data typically comes from heterogeneous model families (e.g., tree ensembles, deep tabular networks, and tabular foundation models) that frequently disagree under distribution shift, missingness, and rare-anomaly regimes. We propose MAD, a Multi-Agent Debating framework that treats this disagreement as a first-class signal and resolves it through a mathematically grounded coordination layer. Each agent is a machine learning (ML)-based detector that produces a normalized anomaly score, confidence, and structured evidence, augmented by a large language model (LLM)-based critic. A coordinator converts these messages into bounded per-agent losses and updates agent influence via an exponentiated-gradient rule, yielding both a final debated anomaly score and an auditable debate trace. MAD is a unified agentic framework that can recover existing approaches, such as mixture-of-experts gating and learning-with-expert-advice aggregation, by restricting the message space and synthesis operator. We establish regret guarantees for the synthesized losses and show how conformal calibration can wrap the debated score to control false positives under exchangeability. Experiments on diverse tabular anomaly benchmarks show improved robustness over baselines and clearer traces of model disagreement

Multi-Agent Debate: A Unified Agentic Framework for Tabular Anomaly Detection

TL;DR

Tabular anomaly detection is challenged by distribution shift, missing data, and rare events, where no single detector is reliably dominant. The authors propose MAD, a multi-agent debating framework in which heterogeneous detectors act as agents that emit a normalized score , a confidence, and structured evidence, optionally reviewed by an LLM critic. A coordinator converts these messages into bounded losses via a synthesis operator and updates agent influence with exponentiated-gradient, yielding a final debated score and an auditable debate trace; the system subsumes standard ensembles as special cases when restricting the message space. Theoretical regret guarantees hold for the synthesized losses, and conformal calibration can wrap the debated score to control false positives under exchangeability. Empirically, MAD improves robustness, calibration, and slice robustness across diverse tabular anomaly benchmarks, with diagnostics that reveal how disagreement is resolved and when it most benefits performance.

Abstract

Tabular anomaly detection is often handled by single detectors or static ensembles, even though strong performance on tabular data typically comes from heterogeneous model families (e.g., tree ensembles, deep tabular networks, and tabular foundation models) that frequently disagree under distribution shift, missingness, and rare-anomaly regimes. We propose MAD, a Multi-Agent Debating framework that treats this disagreement as a first-class signal and resolves it through a mathematically grounded coordination layer. Each agent is a machine learning (ML)-based detector that produces a normalized anomaly score, confidence, and structured evidence, augmented by a large language model (LLM)-based critic. A coordinator converts these messages into bounded per-agent losses and updates agent influence via an exponentiated-gradient rule, yielding both a final debated anomaly score and an auditable debate trace. MAD is a unified agentic framework that can recover existing approaches, such as mixture-of-experts gating and learning-with-expert-advice aggregation, by restricting the message space and synthesis operator. We establish regret guarantees for the synthesized losses and show how conformal calibration can wrap the debated score to control false positives under exchangeability. Experiments on diverse tabular anomaly benchmarks show improved robustness over baselines and clearer traces of model disagreement
Paper Structure (18 sections, 2 theorems, 38 equations, 2 figures, 9 tables, 1 algorithm)

This paper contains 18 sections, 2 theorems, 38 equations, 2 figures, 9 tables, 1 algorithm.

Key Result

Theorem 3.1

Assume $\ell_i^{(t)}\in[0,1]$ for all $i,t$ and initialize $\alpha^{(1)}$ uniformly. Let $\alpha^{(t)}$ be updated by eq:eg_prelim with $\eta\in(0,1]$. Then

Figures (2)

  • Figure 1: MAD Design Overview can be decomposed into four modular building blocks: Perception(From input to agent selection), Action(agent debate), Coordinator, and Output. It separates signal extraction, agent interaction, coordination logic, and decision reporting.
  • Figure 2: Results and diagnostics in one view. (a) Disagreement density by family. (b) Disagreement summary (median with 90th percentile). (c) $\Delta$ROC (dispute-aware minus mean ensemble) vs. disagreement. (d) $\Delta$PR vs. disagreement. (e) Family-level rare-event metrics (PR-AUC, Recall@1%FPR). (f) Component ablation. (g) Effect of debate rounds $T$ (low/mid/high disagreement). (h) Effect of agent pool size $K$. (i) Mean update magnitude $|\text{DA}-\text{mean}|$ vs. disagreement (MAD).

Theorems & Definitions (2)

  • Theorem 3.1: Hedge/EG regret
  • Theorem 4.1: MAD regret