Table of Contents
Fetching ...

Scaling Up Diffusion and Flow-based XGBoost Models

Jesse C. Cresswell, Taewoo Kim

TL;DR

This work targets scaling diffusion- and flow-based generative models for tabular data by using XGBoost as the vector-field regressor. It presents a ground-up re-engineering that reduces memory from quadratic to linear in dataset size, enables much larger models, and introduces algorithmic improvements such as multi-output trees and adaptive early stopping. The approach is validated on large-scale calorimeter data (CaloForest) and 27 benchmark datasets, showing improved generation quality and substantial resource efficiency versus prior implementations. The results offer a practical pathway for CPU-based, scalable tabular generation with strong applicability to scientific simulations.

Abstract

Novel machine learning methods for tabular data generation are often developed on small datasets which do not match the scale required for scientific applications. We investigate a recent proposal to use XGBoost as the function approximator in diffusion and flow-matching models on tabular data, which proved to be extremely memory intensive, even on tiny datasets. In this work, we conduct a critical analysis of the existing implementation from an engineering perspective, and show that these limitations are not fundamental to the method; with better implementation it can be scaled to datasets 370x larger than previously used. Our efficient implementation also unlocks scaling models to much larger sizes which we show directly leads to improved performance on benchmark tasks. We also propose algorithmic improvements that can further benefit resource usage and model performance, including multi-output trees which are well-suited to generative modeling. Finally, we present results on large-scale scientific datasets derived from experimental particle physics as part of the Fast Calorimeter Simulation Challenge. Code is available at https://github.com/layer6ai-labs/calo-forest.

Scaling Up Diffusion and Flow-based XGBoost Models

TL;DR

This work targets scaling diffusion- and flow-based generative models for tabular data by using XGBoost as the vector-field regressor. It presents a ground-up re-engineering that reduces memory from quadratic to linear in dataset size, enables much larger models, and introduces algorithmic improvements such as multi-output trees and adaptive early stopping. The approach is validated on large-scale calorimeter data (CaloForest) and 27 benchmark datasets, showing improved generation quality and substantial resource efficiency versus prior implementations. The results offer a practical pathway for CPU-based, scalable tabular generation with strong applicability to scientific simulations.

Abstract

Novel machine learning methods for tabular data generation are often developed on small datasets which do not match the scale required for scientific applications. We investigate a recent proposal to use XGBoost as the function approximator in diffusion and flow-matching models on tabular data, which proved to be extremely memory intensive, even on tiny datasets. In this work, we conduct a critical analysis of the existing implementation from an engineering perspective, and show that these limitations are not fundamental to the method; with better implementation it can be scaled to datasets 370x larger than previously used. Our efficient implementation also unlocks scaling models to much larger sizes which we show directly leads to improved performance on benchmark tasks. We also propose algorithmic improvements that can further benefit resource usage and model performance, including multi-output trees which are well-suited to generative modeling. Finally, we present results on large-scale scientific datasets derived from experimental particle physics as part of the Fast Calorimeter Simulation Challenge. Code is available at https://github.com/layer6ai-labs/calo-forest.
Paper Structure (34 sections, 8 equations, 26 figures, 9 tables)

This paper contains 34 sections, 8 equations, 26 figures, 9 tables.

Figures (26)

  • Figure 1: Comparison of training time and memory usage between the original implementation and ours. The $\times$ indicates job failure, and the horizontal line indicates the maximum system memory.
  • Figure 3: Number of trees at the best iteration of the validation loss by timestep and dataset. Selected datasets from all 27 are highlighted for comparison to MO in App. \ref{['app:hypers']}. Early stopping after $n_{\text{ES}}=20$ rounds with no improvement prevents wasteful training where no progress is being made.
  • Figure 4: Resource usage of the ForestFlow implementation from jolicoeur2023generating, compared to our implementation (SO), including with multi-output trees (MO), and early stopping (ES). Top: Training time. Middle: Peak memory usage. Bottom: Generation time. A red cross $\times$ for memory indicates job failure, and hence corresponding points in other plots are unavailable. A horizontal line indicates the maximum system memory used for all models at 385 GiB.
  • Figure 5: Histograms of high-level features comparing generated Photons samples to the test set. Note the log scale of the y-axis for all but the first plot.
  • Figure 6: Individual showers shown as energy deposited per voxel for the Photons test dataset (left), and generated by CaloForest (right). Note the nested cylindrical geometry of voxels which is inconsistent across layers, meaning the data must be treated as tabular, rather than as images.
  • ...and 21 more figures