Table of Contents
Fetching ...

BUFF: Boosted Decision Tree based Ultra-Fast Flow matching

Cheng Jiang, Sitian Qian, Huilin Qu

TL;DR

BUFF tackles the bottleneck of fast, high-dimensional tabular data simulation in high-energy physics by replacing neural normalizing-flow backbones with gradient boosted trees in a conditional flow matching framework (flowBDT). The approach yields orders-of-magnitude speedups in training and inference on CPU while maintaining high fidelity across both high-level observables and low-level calorimeter/jet-constituent data, and it benefits importantly from conditional generation to improve correlation fidelity for unfolding tasks. Evaluations on diverse datasets (JetNet, CaloChallenge, unfolding, and Schrödinger Bridge refinement) demonstrate strong performance in end-to-end fast simulation, high-dimensional low-level generation, and conditional sampling, with robust applicability to tasks like anomaly detection and jet tagging. Overall, BUFF provides a scalable, CPU-friendly surrogate capable of rapid, multi-level collider simulations with promising real-world impact for HL-LHC workflows and beyond.

Abstract

Tabular data stands out as one of the most frequently encountered types in high energy physics. Unlike commonly homogeneous data such as pixelated images, simulating high-dimensional tabular data and accurately capturing their correlations are often quite challenging, even with the most advanced architectures. Based on the findings that tree-based models surpass the performance of deep learning models for tasks specific to tabular data, we adopt the very recent generative modeling class named conditional flow matching and employ different techniques to integrate the usage of Gradient Boosted Trees. The performances are evaluated for various tasks on different analysis level with several public datasets. We demonstrate the training and inference time of most high-level simulation tasks can achieve speedup by orders of magnitude. The application can be extended to low-level feature simulation and conditioned generations with competitive performance.

BUFF: Boosted Decision Tree based Ultra-Fast Flow matching

TL;DR

BUFF tackles the bottleneck of fast, high-dimensional tabular data simulation in high-energy physics by replacing neural normalizing-flow backbones with gradient boosted trees in a conditional flow matching framework (flowBDT). The approach yields orders-of-magnitude speedups in training and inference on CPU while maintaining high fidelity across both high-level observables and low-level calorimeter/jet-constituent data, and it benefits importantly from conditional generation to improve correlation fidelity for unfolding tasks. Evaluations on diverse datasets (JetNet, CaloChallenge, unfolding, and Schrödinger Bridge refinement) demonstrate strong performance in end-to-end fast simulation, high-dimensional low-level generation, and conditional sampling, with robust applicability to tasks like anomaly detection and jet tagging. Overall, BUFF provides a scalable, CPU-friendly surrogate capable of rapid, multi-level collider simulations with promising real-world impact for HL-LHC workflows and beyond.

Abstract

Tabular data stands out as one of the most frequently encountered types in high energy physics. Unlike commonly homogeneous data such as pixelated images, simulating high-dimensional tabular data and accurately capturing their correlations are often quite challenging, even with the most advanced architectures. Based on the findings that tree-based models surpass the performance of deep learning models for tasks specific to tabular data, we adopt the very recent generative modeling class named conditional flow matching and employ different techniques to integrate the usage of Gradient Boosted Trees. The performances are evaluated for various tasks on different analysis level with several public datasets. We demonstrate the training and inference time of most high-level simulation tasks can achieve speedup by orders of magnitude. The application can be extended to low-level feature simulation and conditioned generations with competitive performance.
Paper Structure (14 sections, 7 equations, 11 figures, 2 tables)

This paper contains 14 sections, 7 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: The histogram and ratio plots for generated simultanously and original jet variables. Shaded: from JetNet, solid line: from flowBDT
  • Figure 2: The kernel density estimation contour plot for generated and original 100k jet variables. Solid: from JetNet, dotted: from flowBDT
  • Figure 3: Different calorimeter layer energies of the photon sample generated by flowBDT and Geant4.
  • Figure 4: Shower response histogram and ratio plot of the photon sample generated by flowBDT and Geant4.
  • Figure 5: Center of energy in $\eta$ and $\phi$ direction on second layer of the photon shower generated by flowBDT and Geant4
  • ...and 6 more figures