Table of Contents
Fetching ...

Modeling Performance of Data Collection Systems for High-Energy Physics

Wilkie Olin-Ammentorp, Xingfu Wu, Andrew A. Chien

TL;DR

This paper tackles the data-growth challenge in high-energy physics by introducing the SystemFlow model, a graph-based framework that treats inputs, processing, outputs, and communication as a flow of messages to predict system-level performance across heterogeneous hardware. By applying SystemFlow to the CMS DAQ and HL-LHC upgrade scenarios, the authors show that strategic improvements such as early-stage data reduction, GPU-accelerated HLT, and front-end smart sensing can reduce total power by about 60% and increase the amount of relevant data retrieved per energy unit from roughly 0.065 to 0.31 samples/$kJ$, though further advances are needed to meet Run-5 power and cost targets. The framework supports rapid, quantitative comparison of alternative architectures and technologies, guiding investments in front-end processing, accelerator use, and ML-augmented triggers. Overall, SystemFlow provides a valuable tool for planning heterogeneous, energy-aware data acquisition systems in data-intensive experiments.

Abstract

Exponential increases in scientific experimental data are outstripping the rate of progress in silicon technology. As a result, heterogeneous combinations of architectures and process or device technologies are increasingly important to meet the computing demands of future scientific experiments. However, the complexity of heterogeneous computing systems requires systematic modeling to understand performance. We present a model which addresses this need by framing key aspects of data collection pipelines and constraints, and combines them with the important vectors of technology that shape alternatives, computing metrics that allow complex alternatives to be compared. For instance, a data collection pipeline may be characterized by parameters such as sensor sampling rates, amount of data collected, and the overall relevancy of retrieved samples. Alternatives to this pipeline are enabled by hardware development vectors including advancing CMOS, GPUs, neuromorphic computing, and edge computing. By calculating metrics for each alternative such as overall F1 score, power, hardware cost, and energy expended per relevant sample, this model allows alternate data collection systems to be rigorously compared. To demonstrate this model's capability, we apply it to the CMS experiment (and planned HL-LHC upgrade) to evaluate and compare the application of novel technologies in the data acquisition system (DAQ). We demonstrate that improvements to early stages in the DAQ are highly beneficial, greatly reducing the resources required at later stages of processing (such as a 60% power reduction) and increasing the amount of relevant data retrieved from the experiment per unit power (improving from 0.065 to 0.31 samples/kJ) However, we predict further advances will be required in order to meet overall power and cost constraints for the DAQ.

Modeling Performance of Data Collection Systems for High-Energy Physics

TL;DR

This paper tackles the data-growth challenge in high-energy physics by introducing the SystemFlow model, a graph-based framework that treats inputs, processing, outputs, and communication as a flow of messages to predict system-level performance across heterogeneous hardware. By applying SystemFlow to the CMS DAQ and HL-LHC upgrade scenarios, the authors show that strategic improvements such as early-stage data reduction, GPU-accelerated HLT, and front-end smart sensing can reduce total power by about 60% and increase the amount of relevant data retrieved per energy unit from roughly 0.065 to 0.31 samples/, though further advances are needed to meet Run-5 power and cost targets. The framework supports rapid, quantitative comparison of alternative architectures and technologies, guiding investments in front-end processing, accelerator use, and ML-augmented triggers. Overall, SystemFlow provides a valuable tool for planning heterogeneous, energy-aware data acquisition systems in data-intensive experiments.

Abstract

Exponential increases in scientific experimental data are outstripping the rate of progress in silicon technology. As a result, heterogeneous combinations of architectures and process or device technologies are increasingly important to meet the computing demands of future scientific experiments. However, the complexity of heterogeneous computing systems requires systematic modeling to understand performance. We present a model which addresses this need by framing key aspects of data collection pipelines and constraints, and combines them with the important vectors of technology that shape alternatives, computing metrics that allow complex alternatives to be compared. For instance, a data collection pipeline may be characterized by parameters such as sensor sampling rates, amount of data collected, and the overall relevancy of retrieved samples. Alternatives to this pipeline are enabled by hardware development vectors including advancing CMOS, GPUs, neuromorphic computing, and edge computing. By calculating metrics for each alternative such as overall F1 score, power, hardware cost, and energy expended per relevant sample, this model allows alternate data collection systems to be rigorously compared. To demonstrate this model's capability, we apply it to the CMS experiment (and planned HL-LHC upgrade) to evaluate and compare the application of novel technologies in the data acquisition system (DAQ). We demonstrate that improvements to early stages in the DAQ are highly beneficial, greatly reducing the resources required at later stages of processing (such as a 60% power reduction) and increasing the amount of relevant data retrieved from the experiment per unit power (improving from 0.065 to 0.31 samples/kJ) However, we predict further advances will be required in order to meet overall power and cost constraints for the DAQ.
Paper Structure (33 sections, 17 equations, 6 figures, 2 tables)

This paper contains 33 sections, 17 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Increases to the areal density of transistors which can be manufactured has increased exponentially over time, a phenomenon referred to as "Moore's Law." However, continuing this trend is increasingly challenging, and the industry has predicted a slow-down in this trend over coming decades.
  • Figure 2: The amount of data that an experiment produces must be matched with a downstream computational system that meets or exceeds its needs. In the case of high-energy particle colliders, experimental variables such as luminosity and the resolution of detector systems influences the overall amount of data produced. Systems that analyze and classify the data require communication and processing systems. The power, area, and scalability of these systems are influenced by the specific technology being deployed.
  • Figure 3: In a SystemFlow model, components of a computing system are classified into inputs, processing, outputs, and communication. Each category utilizes metrics and functions to capture how it transforms information and the requirements needed to do so. Each alternative technology for sensing, communication, and processing will have a unique set of attributes. By simulating the flow of messages containing information through this system, the attributes of each component interact to estimate component and system-level costs and productivity metrics (such as processing power, communication channels, and number of true positive samples produced per second).
  • Figure 4: An illustration of the CMS DAQ system adapted into a SystemFlow model. Samples originate at sensors providing input to the computing system (left). These samples are transported via communication links to processing nodes (middle), where they may change in size and/or be discarded. These changes propagate downstream to further processing nodes, and eventually, the output (right). The metrics necessary to characterize each node and edge in the SystemFlow model are taken from public documentation.
  • Figure 5: Increasing the pileup of collisions within the detector and increasing the number of samples passed from the L1T to the HLT significantly changes the productivity of the overall DAQ system. As conditions shift from Run-3 (bottom left) to Run-5 (top right), net productivity first increases by reducing the L1T's rejection ratio. However, the energy cost of processing more samples within the HLT as pile-up increases causes productivity to drop steeply.
  • ...and 1 more figures