Table of Contents
Fetching ...

LEMURS dataset: Large-scale multi-detector ElectroMagnetic Universal Representation of Showers

Peter McKeown, Piyush Raikwar, Anna Zaborowska

TL;DR

LEMURS tackles the need for scalable, cross-detector fast calorimeter simulations by providing a large-scale electromagnetic shower dataset across five detectors with diverse geometries. It introduces the Universal grid Representation to describe showers in a detector-agnostic, high-granularity 3D voxel grid, enabling transfer of fast-simulation concepts between detectors. The dataset comprises nearly 1 million EM showers per detector for training and a carefully designed 1,000-shower testing grid for physics validation, generated with Geant4 Par04 via the ddfastsim workflow and released openly in HDF5. This open resource, together with reproducible code and validation demonstrating practical utility (e.g., CaloDiT-2 pretraining), supports benchmarking, cross-detector studies, and foundation-model development in calorimetry for high-energy physics.

Abstract

We present LEMURS: an extensive dataset of simulated calorimeter showers designed to support the development and benchmarking of fast simulation methods in high-energy physics, most notably providing a step towards the development of foundation models. This new dataset is more robust than the well-established CaloChallenge dataset 2, featuring substantially greater statistics, a wider range of incident angles in the detector, and most crucially multiple detector geometries (including more realistic calorimeters). The dataset is provided in HDF5 format, with a file structure inspired by the CaloChallenge shower representation while also including more variables. LEMURS scale and diversity make it particularly suitable for development of foundation models and has been used in the CaloDiT-2 model, a pre-trained model released in the community standard simulation toolkit Geant4 (version 11.4.beta). All data and code for generation and analysis are openly accessible, facilitating reproducibility and reuse across the community.

LEMURS dataset: Large-scale multi-detector ElectroMagnetic Universal Representation of Showers

TL;DR

LEMURS tackles the need for scalable, cross-detector fast calorimeter simulations by providing a large-scale electromagnetic shower dataset across five detectors with diverse geometries. It introduces the Universal grid Representation to describe showers in a detector-agnostic, high-granularity 3D voxel grid, enabling transfer of fast-simulation concepts between detectors. The dataset comprises nearly 1 million EM showers per detector for training and a carefully designed 1,000-shower testing grid for physics validation, generated with Geant4 Par04 via the ddfastsim workflow and released openly in HDF5. This open resource, together with reproducible code and validation demonstrating practical utility (e.g., CaloDiT-2 pretraining), supports benchmarking, cross-detector studies, and foundation-model development in calorimetry for high-energy physics.

Abstract

We present LEMURS: an extensive dataset of simulated calorimeter showers designed to support the development and benchmarking of fast simulation methods in high-energy physics, most notably providing a step towards the development of foundation models. This new dataset is more robust than the well-established CaloChallenge dataset 2, featuring substantially greater statistics, a wider range of incident angles in the detector, and most crucially multiple detector geometries (including more realistic calorimeters). The dataset is provided in HDF5 format, with a file structure inspired by the CaloChallenge shower representation while also including more variables. LEMURS scale and diversity make it particularly suitable for development of foundation models and has been used in the CaloDiT-2 model, a pre-trained model released in the community standard simulation toolkit Geant4 (version 11.4.beta). All data and code for generation and analysis are openly accessible, facilitating reproducibility and reuse across the community.

Paper Structure

This paper contains 21 sections, 1 equation, 85 figures, 3 tables.

Figures (85)

  • Figure 1: Depiction of the demonstrator detector originating from the Par04 example. Additionally, an explanation of the coordinate system is provided. Incident showers enter the detector at different angles: $\theta$ describes the angle with respect to the beam (detector) axis, and $\phi$ defines the direction in the transverse plane of the detector. The incident direction defines the structure of the Universal grid Representation voxelisation.
  • Figure 2: The complexity of the implementation of the calorimeter layers in simulation, with different materials being represented in different colours. Discontinuity between the modules of the polygon are shown in the zoom-in Bacchetta:2019fmz.
  • Figure 3: The Open Data Detector (ODD), with the EM calorimeter depicted in red. In the transverse plane, the individual layers placed in a hexadecagonal structure can be seen. The larger violet structure shows the hadronic calorimeter.
  • Figure 4: The FCCee CLD detector, with the EM calorimeter depicted in dark green Bacchetta:2019fmz.
  • Figure 5: The FCCee ALLEGRO detector Mlynarikova:2025skz.
  • ...and 80 more figures