Table of Contents
Fetching ...

MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection

Xueying Ding, Simon Klüttermann, Haomin Wen, Yilong Chen, Leman Akoglu

TL;DR

MacrOData tackles the bottleneck of small, limited tabular OD benchmarks by introducing a large-scale, open benchmark suite composed of OddBench, OvRBench, and SynBench, totaling $2{,}446$ datasets with standardized train/test splits and semantic metadata. The framework reveals limitations of prior benchmarks (notably ADBench) and demonstrates that foundation models deliver superior performance and practical latency advantages across real-world and synthetic data, while maintaining fair evaluation via public/private leaderboards. The work provides detailed data curation pipelines, anonymization protocols, and representative subsets to enable scalable benchmarking, meta-learning, and foundation-model research in OD. Taken together, MacrOData offers a robust, diverse, and extensible platform that promises more reliable progress tracking and richer insights for practitioners deploying tabular OD methods. The benchmarks, analyses, and released code/search infrastructure are geared toward advancing OD via standardized evaluation and reproducible, scalable experimentation.

Abstract

Quality benchmarks are essential for fairly and accurately tracking scientific progress and enabling practitioners to make informed methodological choices. Outlier detection (OD) on tabular data underpins numerous real-world applications, yet existing OD benchmarks remain limited. The prominent OD benchmark AdBench is the de facto standard in the literature, yet comprises only 57 datasets. In addition to other shortcomings discussed in this work, its small scale severely restricts diversity and statistical power. We introduce MacrOData, a large-scale benchmark suite for tabular OD comprising three carefully curated components: OddBench, with 790 datasets containing real-world semantic anomalies; OvrBench, with 856 datasets featuring real-world statistical outliers; and SynBench, with 800 synthetically generated datasets spanning diverse data priors and outlier archetypes. Owing to its scale and diversity, MacrOData enables comprehensive and statistically robust evaluation of tabular OD methods. Our benchmarks further satisfy several key desiderata: We provide standardized train/test splits for all datasets, public/private benchmark partitions with held-out test labels for the latter reserved toward an online leaderboard, and annotate our datasets with semantic metadata. We conduct extensive experiments across all benchmarks, evaluating a broad range of OD methods comprising classical, deep, and foundation models, over diverse hyperparameter configurations. We report detailed empirical findings, practical guidelines, as well as individual performances as references for future research. All benchmarks containing 2,446 datasets combined are open-sourced, along with a publicly accessible leaderboard hosted at https://huggingface.co/MacrOData-CMU.

MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection

TL;DR

MacrOData tackles the bottleneck of small, limited tabular OD benchmarks by introducing a large-scale, open benchmark suite composed of OddBench, OvRBench, and SynBench, totaling datasets with standardized train/test splits and semantic metadata. The framework reveals limitations of prior benchmarks (notably ADBench) and demonstrates that foundation models deliver superior performance and practical latency advantages across real-world and synthetic data, while maintaining fair evaluation via public/private leaderboards. The work provides detailed data curation pipelines, anonymization protocols, and representative subsets to enable scalable benchmarking, meta-learning, and foundation-model research in OD. Taken together, MacrOData offers a robust, diverse, and extensible platform that promises more reliable progress tracking and richer insights for practitioners deploying tabular OD methods. The benchmarks, analyses, and released code/search infrastructure are geared toward advancing OD via standardized evaluation and reproducible, scalable experimentation.

Abstract

Quality benchmarks are essential for fairly and accurately tracking scientific progress and enabling practitioners to make informed methodological choices. Outlier detection (OD) on tabular data underpins numerous real-world applications, yet existing OD benchmarks remain limited. The prominent OD benchmark AdBench is the de facto standard in the literature, yet comprises only 57 datasets. In addition to other shortcomings discussed in this work, its small scale severely restricts diversity and statistical power. We introduce MacrOData, a large-scale benchmark suite for tabular OD comprising three carefully curated components: OddBench, with 790 datasets containing real-world semantic anomalies; OvrBench, with 856 datasets featuring real-world statistical outliers; and SynBench, with 800 synthetically generated datasets spanning diverse data priors and outlier archetypes. Owing to its scale and diversity, MacrOData enables comprehensive and statistically robust evaluation of tabular OD methods. Our benchmarks further satisfy several key desiderata: We provide standardized train/test splits for all datasets, public/private benchmark partitions with held-out test labels for the latter reserved toward an online leaderboard, and annotate our datasets with semantic metadata. We conduct extensive experiments across all benchmarks, evaluating a broad range of OD methods comprising classical, deep, and foundation models, over diverse hyperparameter configurations. We report detailed empirical findings, practical guidelines, as well as individual performances as references for future research. All benchmarks containing 2,446 datasets combined are open-sourced, along with a publicly accessible leaderboard hosted at https://huggingface.co/MacrOData-CMU.
Paper Structure (83 sections, 11 equations, 23 figures, 17 tables)

This paper contains 83 sections, 11 equations, 23 figures, 17 tables.

Figures (23)

  • Figure 1: t-SNE embedding of datasets (based on dataset-level metafeatures Vanschoren2018MetaLearningAS) across three macrOData benchmarks (ours) and from ADBench han2022adbench show that macrOData offers significant boost in both scale and diversity. (best in color)
  • Figure 2: Summary statistics of macrOData vs. ADBench.
  • Figure 3: Overview of curation steps used to build OddBench and OvRBench.
  • Figure 4: Word clouds of metadata keywords across OddBench datasets showing (a) the semantic anomalies and (b) a broad spectrum of domains covered. Full version is shown in (a), while overly frequent keywords including "anomal", "status", "outlier", "classification", "data", "performance", "failure", "quality", "location", and "error" are dropped in (b).
  • Figure 5: Composition of OvRBench, which exhibits significant data diversity by integrating a wide array of sources.
  • ...and 18 more figures