Table of Contents
Fetching ...

Simple Image Processing and Similarity Measures Can Link Data Samples across Databases through Brain MRI

Gaurang Sharma, Harri Polonen, Juha Pajula, Jutta Suksi, Jussi Tohka

TL;DR

The study shows that unsupervised, standard MRI preprocessing followed by simple image similarity measures can nearly perfectly link skull-stripped T1-weighted MRIs of the same individual across databases, timepoints, scanners, and protocols, even under cognitive decline. By combining affine alignment, intensity harmonization, skull-stripping, and 11 similarity metrics with KDE-based thresholding, the approach achieves near-perfect discrimination between intra- and inter-participant pairs across diverse datasets (including ADNI and SDSU-TS) and cross-protocol scenarios. The findings highlight a tangible privacy risk in shared neuroimaging data and underscore the need for rigorous data governance, consent, and risk assessments when releasing such data. The work also provides a practical, scalable framework for evaluating linkage risk, with implications for policy-making and future exploration of feature drivers behind MRI-based re-identification and extensions to other modalities.

Abstract

Head Magnetic Resonance Imaging (MRI) is routinely collected and shared for research under strict regulatory frameworks. These frameworks require removing potential identifiers before sharing. But, even after skull stripping, the brain parenchyma contains unique signatures that can match other MRIs from the same participants across databases, posing a privacy risk if additional data features are available. Current regulatory frameworks often mandate evaluating such risks based on the assessment of a certain level of reasonableness. Prior studies have already suggested that a brain MRI could enable participant linkage, but they have relied on training-based or computationally intensive methods. Here, we demonstrate that linking an individual's skull-stripped T1-weighted MRI, which may lead to re-identification if other identifiers are available, is possible using standard preprocessing followed by image similarity computation. Nearly perfect linkage accuracy was achieved in matching data samples across various time intervals, scanner types, spatial resolutions, and acquisition protocols, despite potential cognitive decline, simulating MRI matching across databases. These results aim to contribute meaningfully to the development of thoughtful, forward-looking policies in medical data sharing.

Simple Image Processing and Similarity Measures Can Link Data Samples across Databases through Brain MRI

TL;DR

The study shows that unsupervised, standard MRI preprocessing followed by simple image similarity measures can nearly perfectly link skull-stripped T1-weighted MRIs of the same individual across databases, timepoints, scanners, and protocols, even under cognitive decline. By combining affine alignment, intensity harmonization, skull-stripping, and 11 similarity metrics with KDE-based thresholding, the approach achieves near-perfect discrimination between intra- and inter-participant pairs across diverse datasets (including ADNI and SDSU-TS) and cross-protocol scenarios. The findings highlight a tangible privacy risk in shared neuroimaging data and underscore the need for rigorous data governance, consent, and risk assessments when releasing such data. The work also provides a practical, scalable framework for evaluating linkage risk, with implications for policy-making and future exploration of feature drivers behind MRI-based re-identification and extensions to other modalities.

Abstract

Head Magnetic Resonance Imaging (MRI) is routinely collected and shared for research under strict regulatory frameworks. These frameworks require removing potential identifiers before sharing. But, even after skull stripping, the brain parenchyma contains unique signatures that can match other MRIs from the same participants across databases, posing a privacy risk if additional data features are available. Current regulatory frameworks often mandate evaluating such risks based on the assessment of a certain level of reasonableness. Prior studies have already suggested that a brain MRI could enable participant linkage, but they have relied on training-based or computationally intensive methods. Here, we demonstrate that linking an individual's skull-stripped T1-weighted MRI, which may lead to re-identification if other identifiers are available, is possible using standard preprocessing followed by image similarity computation. Nearly perfect linkage accuracy was achieved in matching data samples across various time intervals, scanner types, spatial resolutions, and acquisition protocols, despite potential cognitive decline, simulating MRI matching across databases. These results aim to contribute meaningfully to the development of thoughtful, forward-looking policies in medical data sharing.
Paper Structure (20 sections, 18 equations, 5 figures, 5 tables)

This paper contains 20 sections, 18 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Brain MRI collected across sites, scanners, and timepoints, even after skill-stripping, can retain individual-specific features. When multi-source data (Left) is harmonized, it may inadvertently converge toward participant-specific patterns (Right), effectively linking records and increasing the risk of data samples matching.
  • Figure 2: Overview of the pipeline and example transformations of synthetic MRI data from the SLDM dataset. The top row depicts the full proposed pipeline, which relies on standard MRI processing steps. The middle and bottom rows show an example SLDM image, its transformed variants, and their harmonized outputs, illustrated using a coronal slice at MNI y-coordinate $-15mm$. These examples highlight the pipeline’s ability to standardize both anatomical alignment and intensity distributions. Skull stripping was applied to restrict visualization to the brain.
  • Figure 3: Pre-evaluation on SHCP dataset demonstrates that both intensity and anatomical harmonization were needed to separate inter-participant scan-pairs from intra-participant scan pairs based on similarity measures. Intra and inter-participant labels are based on ground-truth, but are not used in the computation of similarity measures.
  • Figure 4: Pre-evaluation on SLDM, after harmonization, demonstrates that all measures except NFID clearly distinguish between inter- and intra-participant clusters. Unsupervised thresholds predicted labels for each image pair, yielding an AUC of 1.000 and with a high sensitivity of 0.978--0.999, except for NFID, which had an AUC of 0.82 and a sensitivity of 0.121. To visualize, we removed high-NFID outliers (10,573 negative FID values $< -0.5006)$ using the interquartile range (IQR) method. Intra and inter-participant labels are based on ground-truth, but not used in the computation of similarity measures or in thresholding.
  • Figure 5: Evaluation on the multi-protocol study demonstrates robust MRI matching across ADNI protocols. For each query image from ADNI1, the corresponding image in ADNI2 was correctly matched, achieving an AUC and specificity of 1.00. Sensitivity exceeded 0.99 overall and reached 1.00 when using SSIM and PCC. Performance remained stable across different imaging protocols, despite potential cognitive decline, and acquisition intervals. Intra and inter-participant labels are based on ground-truth, but not used in the computation of similarity measures nor in thresholding.