Table of Contents
Fetching ...

Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation

Richard J. Young

TL;DR

This work investigates whether inference-time vision token masking can prevent PHI leakage in medical document OCR. Through seven masking strategies applied to DeepSeek-OCR, the study finds a consistent 42.9% PHI reduction, successfully redacting long-form, spatially distributed identifiers (names, DOB, addresses) but failing to prevent short structured identifiers (MRN, SSN, emails, account numbers) due to language-model contextual inference. A simulated hybrid pipeline combining vision masking with NLP post-processing suggests up to 88.6% total PHI reduction under an 80% NLP accuracy assumption, highlighting the potential of defense-in-depth approaches while also revealing the limits of vision-only interventions. The findings provide a rigorous boundary for privacy-preserving OCR in healthcare, advocate redirecting efforts toward decoder-level fine-tuning and hybrid architectures, and emphasize the need for real-data validation and multi-stakeholder collaboration to achieve robust HIPAA-compliant medical document processing.

Abstract

Large vision-language models (VLMs) are increasingly deployed for optical character recognition (OCR) in healthcare settings, raising critical concerns about protected health information (PHI) exposure during document processing. This work presents the first systematic evaluation of inference-time vision token masking as a privacy-preserving mechanism for medical document OCR using DeepSeek-OCR. We introduce seven masking strategies (V3-V9) targeting different architectural layers (SAM encoder blocks, compression layers, dual vision encoders, projector fusion) and evaluate PHI reduction across HIPAA-defined categories using 100 synthetic medical billing statements (drawn from a corpus of 38,517 annotated documents) with perfect ground-truth annotations. All masking strategies converge to 42.9% PHI reduction, successfully suppressing long-form spatially-distributed identifiers (patient names, dates of birth, physical addresses at 100% effectiveness) while failing to prevent short structured identifiers (medical record numbers, social security numbers, email addresses, account numbers at 0% effectiveness). Ablation studies varying mask expansion radius (r=1,2,3) demonstrate that increased spatial coverage does not improve reduction beyond this ceiling, indicating that language model contextual inference - not insufficient visual masking - drives structured identifier leakage. A simulated hybrid architecture combining vision masking with NLP post-processing achieves 88.6% total PHI reduction (assuming 80% NLP accuracy on remaining identifiers). This negative result establishes boundaries for vision-only privacy interventions in VLMs, provides guidance distinguishing PHI types amenable to vision-level versus language-level redaction, and redirects future research toward decoder-level fine-tuning and hybrid defense-in-depth architectures for HIPAA-compliant medical document processing.

Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation

TL;DR

This work investigates whether inference-time vision token masking can prevent PHI leakage in medical document OCR. Through seven masking strategies applied to DeepSeek-OCR, the study finds a consistent 42.9% PHI reduction, successfully redacting long-form, spatially distributed identifiers (names, DOB, addresses) but failing to prevent short structured identifiers (MRN, SSN, emails, account numbers) due to language-model contextual inference. A simulated hybrid pipeline combining vision masking with NLP post-processing suggests up to 88.6% total PHI reduction under an 80% NLP accuracy assumption, highlighting the potential of defense-in-depth approaches while also revealing the limits of vision-only interventions. The findings provide a rigorous boundary for privacy-preserving OCR in healthcare, advocate redirecting efforts toward decoder-level fine-tuning and hybrid architectures, and emphasize the need for real-data validation and multi-stakeholder collaboration to achieve robust HIPAA-compliant medical document processing.

Abstract

Large vision-language models (VLMs) are increasingly deployed for optical character recognition (OCR) in healthcare settings, raising critical concerns about protected health information (PHI) exposure during document processing. This work presents the first systematic evaluation of inference-time vision token masking as a privacy-preserving mechanism for medical document OCR using DeepSeek-OCR. We introduce seven masking strategies (V3-V9) targeting different architectural layers (SAM encoder blocks, compression layers, dual vision encoders, projector fusion) and evaluate PHI reduction across HIPAA-defined categories using 100 synthetic medical billing statements (drawn from a corpus of 38,517 annotated documents) with perfect ground-truth annotations. All masking strategies converge to 42.9% PHI reduction, successfully suppressing long-form spatially-distributed identifiers (patient names, dates of birth, physical addresses at 100% effectiveness) while failing to prevent short structured identifiers (medical record numbers, social security numbers, email addresses, account numbers at 0% effectiveness). Ablation studies varying mask expansion radius (r=1,2,3) demonstrate that increased spatial coverage does not improve reduction beyond this ceiling, indicating that language model contextual inference - not insufficient visual masking - drives structured identifier leakage. A simulated hybrid architecture combining vision masking with NLP post-processing achieves 88.6% total PHI reduction (assuming 80% NLP accuracy on remaining identifiers). This negative result establishes boundaries for vision-only privacy interventions in VLMs, provides guidance distinguishing PHI types amenable to vision-level versus language-level redaction, and redirects future research toward decoder-level fine-tuning and hybrid defense-in-depth architectures for HIPAA-compliant medical document processing.

Paper Structure

This paper contains 52 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Bounding box to patch masking pipeline. PHI regions are mapped to the SAM grid, dilated, replaced with mask tokens, and passed through compression before decoding. This highlights the pre-compression intervention point.
  • Figure 2: Bounding boxes to SAM patch indices. Coordinate mapping from PDF bounding boxes to SAM patch indices with dilation radius $r$, showing tiling, index computation with rounding, dilation, and construction of the mask set $S$ for forward hooks.
  • Figure 3: DeepSeek-OCR masking hook points. Hook options span SAM blocks, compression neck, auxiliary vision encoder, projector, and decoder output; pre-compression masking targets the SAM path before fusion.
  • Figure 4: Compression leakage schematic. Overlapping convolutional receptive fields in the compression neck mix each PHI patch with neighbors, so a single PHI region influences several compressed tokens. Post-compression masking cannot fully remove the PHI signal; masking only after compression leaves residual leakage.
  • Figure 5: Convergence across masking variants. SAM-, compression-, and projector-level masking all plateau at 42.9%, indicating architectural limits of vision-only defenses.
  • ...and 6 more figures