Table of Contents
Fetching ...

PASTA-4-PHT: A Pipeline for Automated Security and Technical Audits for the Personal Health Train

Sascha Welten, Karl Kindermann, Ahmet Polat, Martin Görz, Maximilian Jugl, Laurenz Neumann, Alexander Neumann, Johannes Lohmöller, Jan Pennekamp, Stefan Decker

TL;DR

The paper addresses privacy concerns in distributed healthcare analytics by proposing pasta4pht, an automated DevSecOps-inspired security and audit pipeline for the Personal Health Train. It analyzes vulnerabilities across three aggregation states—source code, packaged images, and executable runtime—using a combination of documentation, static analysis, dependency checks, image scanning, and runtime testing, with outputs stored in a graph database and reports suitable for DPIA/GDPR workflows. Controlled and real-world evaluations demonstrate pasta's ability to detect deliberate and real vulnerabilities (e.g., arbitrary code execution, SQL injection, vulnerable libraries, secret exposure, and base-image flaws) and to guide safer Train deployments. The work enhances security, transparency, and governance of data-processing on sensitive health data, contributing to FAIRness in research software and providing a foundation for automated compliance documentation and future meta-analyses.

Abstract

With the introduction of data protection regulations, the need for innovative privacy-preserving approaches to process and analyse sensitive data has become apparent. One approach is the Personal Health Train (PHT) that brings analysis code to the data and conducts the data processing at the data premises. However, despite its demonstrated success in various studies, the execution of external code in sensitive environments, such as hospitals, introduces new research challenges because the interactions of the code with sensitive data are often incomprehensible and lack transparency. These interactions raise concerns about potential effects on the data and increases the risk of data breaches. To address this issue, this work discusses a PHT-aligned security and audit pipeline inspired by DevSecOps principles. The automated pipeline incorporates multiple phases that detect vulnerabilities. To thoroughly study its versatility, we evaluate this pipeline in two ways. First, we deliberately introduce vulnerabilities into a PHT. Second, we apply our pipeline to five real-world PHTs, which have been utilised in real-world studies, to audit them for potential vulnerabilities. Our evaluation demonstrates that our designed pipeline successfully identifies potential vulnerabilities and can be applied to real-world studies. In compliance with the requirements of the GDPR for data management, documentation, and protection, our automated approach supports researchers using in their data-intensive work and reduces manual overhead. It can be used as a decision-making tool to assess and document potential vulnerabilities in code for data processing. Ultimately, our work contributes to an increased security and overall transparency of data processing activities within the PHT framework.

PASTA-4-PHT: A Pipeline for Automated Security and Technical Audits for the Personal Health Train

TL;DR

The paper addresses privacy concerns in distributed healthcare analytics by proposing pasta4pht, an automated DevSecOps-inspired security and audit pipeline for the Personal Health Train. It analyzes vulnerabilities across three aggregation states—source code, packaged images, and executable runtime—using a combination of documentation, static analysis, dependency checks, image scanning, and runtime testing, with outputs stored in a graph database and reports suitable for DPIA/GDPR workflows. Controlled and real-world evaluations demonstrate pasta's ability to detect deliberate and real vulnerabilities (e.g., arbitrary code execution, SQL injection, vulnerable libraries, secret exposure, and base-image flaws) and to guide safer Train deployments. The work enhances security, transparency, and governance of data-processing on sensitive health data, contributing to FAIRness in research software and providing a foundation for automated compliance documentation and future meta-analyses.

Abstract

With the introduction of data protection regulations, the need for innovative privacy-preserving approaches to process and analyse sensitive data has become apparent. One approach is the Personal Health Train (PHT) that brings analysis code to the data and conducts the data processing at the data premises. However, despite its demonstrated success in various studies, the execution of external code in sensitive environments, such as hospitals, introduces new research challenges because the interactions of the code with sensitive data are often incomprehensible and lack transparency. These interactions raise concerns about potential effects on the data and increases the risk of data breaches. To address this issue, this work discusses a PHT-aligned security and audit pipeline inspired by DevSecOps principles. The automated pipeline incorporates multiple phases that detect vulnerabilities. To thoroughly study its versatility, we evaluate this pipeline in two ways. First, we deliberately introduce vulnerabilities into a PHT. Second, we apply our pipeline to five real-world PHTs, which have been utilised in real-world studies, to audit them for potential vulnerabilities. Our evaluation demonstrates that our designed pipeline successfully identifies potential vulnerabilities and can be applied to real-world studies. In compliance with the requirements of the GDPR for data management, documentation, and protection, our automated approach supports researchers using in their data-intensive work and reduces manual overhead. It can be used as a decision-making tool to assess and document potential vulnerabilities in code for data processing. Ultimately, our work contributes to an increased security and overall transparency of data processing activities within the PHT framework.

Paper Structure

This paper contains 39 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Workflow of da with multiple data holders such as hospitals. Analysis results are aggregated after each execution at the respective hospitals and returned to the analysis requester. If the code is not properly reviewed, it may contain vulnerabilities that could compromise the IT of the sensitive environments.
  • Figure 2: The pht infrastructure with Container Trains. A Container Train is based on software containers and contains the analysis code, dependencies, libraries, and results. At each Station, the Train is executed, allowing access to and processing of data during the analysis execution (Steps 1--5). We assume that the analysis code, along with its dependencies and libraries (left), might contain vulnerabilities.
  • Figure 3: Train lifecycle diagram for Container Trains. First, a Train is represented in raw format. After a build step, the Train is represented by an image and managed by a Train handler (i.e. the pht infrastructure) that orchestrates the Train images to the Stations. At a Station, the Train code needs to be executed. Hence, the image is transformed into an executable Container. Terminology partially adapted from Bonino et al. and Beyan et al. BoninoDaSilvaSantos2022beyanDistributedAnalyticsSensitive2020.
  • Figure 4: Formal visualisation of pasta.
  • Figure 5: Integrating Train Code into a semantic framework using an ast. We first extract the syntactical structure of the Train code by using ast. Given that ast are inherently a graph, we seamlessly merge this ast-graph into a broader semantic context and enrich it with metadata weltenDams2021. For the sake of simplicity, we adopt the terminology for tree nodes from the ast library (Tree-Sitter) and omitted the description of the arcs in this figure.
  • ...and 2 more figures