Table of Contents
Fetching ...

Introduction to the Usage of Open Data from the Large Hadron Collider for Computer Scientists in the Context of Machine Learning

Timo Saala, Matthias Schott

TL;DR

This paper tackles the barrier to using LHC open data by providing clear data descriptions and a practical pipeline to convert CMS ROOT files into Pandas DataFrames for computer scientists. It gives a concise primer on SM particles, detectors, and common observables to enable machine learning applications. It presents a transformation workflow from ROOT to pandas in a Docker-based pipeline, plus benchmarking showing feather and parquet formats offer favorable trade-offs. The work aims to lower entry barriers for interdisciplinary collaboration and to foster rapid development of ML methods on high-energy physics data.

Abstract

Deep learning techniques have evolved rapidly in recent years, significantly impacting various scientific fields, including experimental particle physics. To effectively leverage the latest developments in computer science for particle physics, a strengthened collaboration between computer scientists and physicists is essential. As all machine learning techniques depend on the availability and comprehensibility of extensive data, clear data descriptions and commonly used data formats are prerequisites for successful collaboration. In this study, we converted open data from the Large Hadron Collider, recorded in the ROOT data format commonly used in high-energy physics, to pandas DataFrames, a well-known format in computer science. Additionally, we provide a brief introduction to the data's content and interpretation. This paper aims to serve as a starting point for future interdisciplinary collaborations between computer scientists and physicists, fostering closer ties and facilitating efficient knowledge exchange.

Introduction to the Usage of Open Data from the Large Hadron Collider for Computer Scientists in the Context of Machine Learning

TL;DR

This paper tackles the barrier to using LHC open data by providing clear data descriptions and a practical pipeline to convert CMS ROOT files into Pandas DataFrames for computer scientists. It gives a concise primer on SM particles, detectors, and common observables to enable machine learning applications. It presents a transformation workflow from ROOT to pandas in a Docker-based pipeline, plus benchmarking showing feather and parquet formats offer favorable trade-offs. The work aims to lower entry barriers for interdisciplinary collaboration and to foster rapid development of ML methods on high-energy physics data.

Abstract

Deep learning techniques have evolved rapidly in recent years, significantly impacting various scientific fields, including experimental particle physics. To effectively leverage the latest developments in computer science for particle physics, a strengthened collaboration between computer scientists and physicists is essential. As all machine learning techniques depend on the availability and comprehensibility of extensive data, clear data descriptions and commonly used data formats are prerequisites for successful collaboration. In this study, we converted open data from the Large Hadron Collider, recorded in the ROOT data format commonly used in high-energy physics, to pandas DataFrames, a well-known format in computer science. Additionally, we provide a brief introduction to the data's content and interpretation. This paper aims to serve as a starting point for future interdisciplinary collaborations between computer scientists and physicists, fostering closer ties and facilitating efficient knowledge exchange.
Paper Structure (40 sections, 25 equations, 27 figures, 15 tables)

This paper contains 40 sections, 25 equations, 27 figures, 15 tables.

Figures (27)

  • Figure 1: Schematic illustration of a tracking detector including three vertices and several charged particles that are measured by a pixel detector.
  • Figure 2: Schematic illustration of an electromagnetic calorimeter and the energy deposits in its cells.
  • Figure 3: Feynmann diagram visualizing the $q \bar{q} \rightarrow Z \rightarrow \mu^+ \mu^-$ process.
  • Figure 4: Feynmann diagram visualizing the $q \bar{q} \rightarrow g \rightarrow q \bar{q}$ process.
  • Figure 5: Feynmann diagram visualizing the $gg \rightarrow g \rightarrow t \bar{t} \rightarrow W^+ W^- b \bar{b} \rightarrow e^+ \nu_e b \bar{b} d \bar{u}$ process.
  • ...and 22 more figures