PASTA-4-PHT: A Pipeline for Automated Security and Technical Audits for the Personal Health Train
Sascha Welten, Karl Kindermann, Ahmet Polat, Martin Görz, Maximilian Jugl, Laurenz Neumann, Alexander Neumann, Johannes Lohmöller, Jan Pennekamp, Stefan Decker
TL;DR
The paper addresses privacy concerns in distributed healthcare analytics by proposing pasta4pht, an automated DevSecOps-inspired security and audit pipeline for the Personal Health Train. It analyzes vulnerabilities across three aggregation states—source code, packaged images, and executable runtime—using a combination of documentation, static analysis, dependency checks, image scanning, and runtime testing, with outputs stored in a graph database and reports suitable for DPIA/GDPR workflows. Controlled and real-world evaluations demonstrate pasta's ability to detect deliberate and real vulnerabilities (e.g., arbitrary code execution, SQL injection, vulnerable libraries, secret exposure, and base-image flaws) and to guide safer Train deployments. The work enhances security, transparency, and governance of data-processing on sensitive health data, contributing to FAIRness in research software and providing a foundation for automated compliance documentation and future meta-analyses.
Abstract
With the introduction of data protection regulations, the need for innovative privacy-preserving approaches to process and analyse sensitive data has become apparent. One approach is the Personal Health Train (PHT) that brings analysis code to the data and conducts the data processing at the data premises. However, despite its demonstrated success in various studies, the execution of external code in sensitive environments, such as hospitals, introduces new research challenges because the interactions of the code with sensitive data are often incomprehensible and lack transparency. These interactions raise concerns about potential effects on the data and increases the risk of data breaches. To address this issue, this work discusses a PHT-aligned security and audit pipeline inspired by DevSecOps principles. The automated pipeline incorporates multiple phases that detect vulnerabilities. To thoroughly study its versatility, we evaluate this pipeline in two ways. First, we deliberately introduce vulnerabilities into a PHT. Second, we apply our pipeline to five real-world PHTs, which have been utilised in real-world studies, to audit them for potential vulnerabilities. Our evaluation demonstrates that our designed pipeline successfully identifies potential vulnerabilities and can be applied to real-world studies. In compliance with the requirements of the GDPR for data management, documentation, and protection, our automated approach supports researchers using in their data-intensive work and reduces manual overhead. It can be used as a decision-making tool to assess and document potential vulnerabilities in code for data processing. Ultimately, our work contributes to an increased security and overall transparency of data processing activities within the PHT framework.
