The LHCb Stripping Project: Sustainable Legacy Data Processing for High-Energy Physics
Nathan Grieser, Eduardo Rodrigues, Niladri Sahoo, Shuqi Sheng, Nicole Skidmore, Mark Smith
TL;DR
This work tackles sustaining access to LHCb Run 1 and 2 legacy data by detailing the Stripping framework, its Python-based configurability, and its integration into the LHCb DPA and DaVinci ecosystems. It presents a GitLab-era campaign workflow with CI-driven development, validation, and YAML-based production management, enabling scalable, reproducible (re-)Stripping campaigns for legacy data. Key contributions include the modernization of collaboration processes, quantification of data-reduction outcomes, and a practical roadmap for future campaigns that balance legacy preservation with ongoing physics needs. The approach delivers efficient, scalable processing of massive data volumes while preserving analysis integrity, ensuring continued scientific value from historic LHCb datasets.
Abstract
The LHCb Stripping project is a pivotal component of the experiment's data processing framework, designed to refine vast volumes of collision data into manageable samples for offline analysis. It ensures the re-analysis of Runs 1 and 2 legacy data, maintains the software stack, and executes (re-)Stripping campaigns. As the focus shifts toward newer data sets, the project continues to optimize infrastructure for both legacy and live data processing. This paper provides a comprehensive overview of the Stripping framework, detailing its Python-configurable architecture, integration with LHCb computing systems, and large-scale campaign management. We highlight organizational advancements such as GitLab-based workflows, continuous integration, automation, and parallelized processing, alongside computational challenges. Finally, we discuss lessons learned and outline a future road-map to sustain efficient access to valuable physics legacy data sets for the LHCb collaboration.
