Table of Contents
Fetching ...

The LHCb Stripping Project: Sustainable Legacy Data Processing for High-Energy Physics

Nathan Grieser, Eduardo Rodrigues, Niladri Sahoo, Shuqi Sheng, Nicole Skidmore, Mark Smith

TL;DR

This work tackles sustaining access to LHCb Run 1 and 2 legacy data by detailing the Stripping framework, its Python-based configurability, and its integration into the LHCb DPA and DaVinci ecosystems. It presents a GitLab-era campaign workflow with CI-driven development, validation, and YAML-based production management, enabling scalable, reproducible (re-)Stripping campaigns for legacy data. Key contributions include the modernization of collaboration processes, quantification of data-reduction outcomes, and a practical roadmap for future campaigns that balance legacy preservation with ongoing physics needs. The approach delivers efficient, scalable processing of massive data volumes while preserving analysis integrity, ensuring continued scientific value from historic LHCb datasets.

Abstract

The LHCb Stripping project is a pivotal component of the experiment's data processing framework, designed to refine vast volumes of collision data into manageable samples for offline analysis. It ensures the re-analysis of Runs 1 and 2 legacy data, maintains the software stack, and executes (re-)Stripping campaigns. As the focus shifts toward newer data sets, the project continues to optimize infrastructure for both legacy and live data processing. This paper provides a comprehensive overview of the Stripping framework, detailing its Python-configurable architecture, integration with LHCb computing systems, and large-scale campaign management. We highlight organizational advancements such as GitLab-based workflows, continuous integration, automation, and parallelized processing, alongside computational challenges. Finally, we discuss lessons learned and outline a future road-map to sustain efficient access to valuable physics legacy data sets for the LHCb collaboration.

The LHCb Stripping Project: Sustainable Legacy Data Processing for High-Energy Physics

TL;DR

This work tackles sustaining access to LHCb Run 1 and 2 legacy data by detailing the Stripping framework, its Python-based configurability, and its integration into the LHCb DPA and DaVinci ecosystems. It presents a GitLab-era campaign workflow with CI-driven development, validation, and YAML-based production management, enabling scalable, reproducible (re-)Stripping campaigns for legacy data. Key contributions include the modernization of collaboration processes, quantification of data-reduction outcomes, and a practical roadmap for future campaigns that balance legacy preservation with ongoing physics needs. The approach delivers efficient, scalable processing of massive data volumes while preserving analysis integrity, ensuring continued scientific value from historic LHCb datasets.

Abstract

The LHCb Stripping project is a pivotal component of the experiment's data processing framework, designed to refine vast volumes of collision data into manageable samples for offline analysis. It ensures the re-analysis of Runs 1 and 2 legacy data, maintains the software stack, and executes (re-)Stripping campaigns. As the focus shifts toward newer data sets, the project continues to optimize infrastructure for both legacy and live data processing. This paper provides a comprehensive overview of the Stripping framework, detailing its Python-configurable architecture, integration with LHCb computing systems, and large-scale campaign management. We highlight organizational advancements such as GitLab-based workflows, continuous integration, automation, and parallelized processing, alongside computational challenges. Finally, we discuss lessons learned and outline a future road-map to sustain efficient access to valuable physics legacy data sets for the LHCb collaboration.

Paper Structure

This paper contains 11 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The LHCb dataflow in Run 2 CERN-LHCC-2018-007 as described in Section \ref{['Stripping']}. The (re-)Stripping stage serves as the last centralized production stage before offline analysis workflows.
  • Figure 2: Schematics of the legacy stack utilized for the LHCb offline data processing stage during Runs 1 and 2.
  • Figure 3: GitLab workflow for the development of the recent Stripping campaign. The workflow is compartmentalized to allow significant reductions of top-level review, while allowing lower level reviews to have a closer focus on the physics performance.
  • Figure 4: Share of storage space taken by the various output streams in Table \ref{['table:latest_stripping_campaigns']}. Values are provided as percentages out of 100.
  • Figure 5: Share of storage space taken by the various output streams in two Run-2 2018 campaigns. Values are provided as percentages out of 100.
  • ...and 1 more figures