MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection
Xueying Ding, Simon Klüttermann, Haomin Wen, Yilong Chen, Leman Akoglu
TL;DR
MacrOData tackles the bottleneck of small, limited tabular OD benchmarks by introducing a large-scale, open benchmark suite composed of OddBench, OvRBench, and SynBench, totaling $2{,}446$ datasets with standardized train/test splits and semantic metadata. The framework reveals limitations of prior benchmarks (notably ADBench) and demonstrates that foundation models deliver superior performance and practical latency advantages across real-world and synthetic data, while maintaining fair evaluation via public/private leaderboards. The work provides detailed data curation pipelines, anonymization protocols, and representative subsets to enable scalable benchmarking, meta-learning, and foundation-model research in OD. Taken together, MacrOData offers a robust, diverse, and extensible platform that promises more reliable progress tracking and richer insights for practitioners deploying tabular OD methods. The benchmarks, analyses, and released code/search infrastructure are geared toward advancing OD via standardized evaluation and reproducible, scalable experimentation.
Abstract
Quality benchmarks are essential for fairly and accurately tracking scientific progress and enabling practitioners to make informed methodological choices. Outlier detection (OD) on tabular data underpins numerous real-world applications, yet existing OD benchmarks remain limited. The prominent OD benchmark AdBench is the de facto standard in the literature, yet comprises only 57 datasets. In addition to other shortcomings discussed in this work, its small scale severely restricts diversity and statistical power. We introduce MacrOData, a large-scale benchmark suite for tabular OD comprising three carefully curated components: OddBench, with 790 datasets containing real-world semantic anomalies; OvrBench, with 856 datasets featuring real-world statistical outliers; and SynBench, with 800 synthetically generated datasets spanning diverse data priors and outlier archetypes. Owing to its scale and diversity, MacrOData enables comprehensive and statistically robust evaluation of tabular OD methods. Our benchmarks further satisfy several key desiderata: We provide standardized train/test splits for all datasets, public/private benchmark partitions with held-out test labels for the latter reserved toward an online leaderboard, and annotate our datasets with semantic metadata. We conduct extensive experiments across all benchmarks, evaluating a broad range of OD methods comprising classical, deep, and foundation models, over diverse hyperparameter configurations. We report detailed empirical findings, practical guidelines, as well as individual performances as references for future research. All benchmarks containing 2,446 datasets combined are open-sourced, along with a publicly accessible leaderboard hosted at https://huggingface.co/MacrOData-CMU.
