Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation

Richard J. Young

Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation

Richard J. Young

TL;DR

This work investigates whether inference-time vision token masking can prevent PHI leakage in medical document OCR. Through seven masking strategies applied to DeepSeek-OCR, the study finds a consistent 42.9% PHI reduction, successfully redacting long-form, spatially distributed identifiers (names, DOB, addresses) but failing to prevent short structured identifiers (MRN, SSN, emails, account numbers) due to language-model contextual inference. A simulated hybrid pipeline combining vision masking with NLP post-processing suggests up to 88.6% total PHI reduction under an 80% NLP accuracy assumption, highlighting the potential of defense-in-depth approaches while also revealing the limits of vision-only interventions. The findings provide a rigorous boundary for privacy-preserving OCR in healthcare, advocate redirecting efforts toward decoder-level fine-tuning and hybrid architectures, and emphasize the need for real-data validation and multi-stakeholder collaboration to achieve robust HIPAA-compliant medical document processing.

Abstract

Large vision-language models (VLMs) are increasingly deployed for optical character recognition (OCR) in healthcare settings, raising critical concerns about protected health information (PHI) exposure during document processing. This work presents the first systematic evaluation of inference-time vision token masking as a privacy-preserving mechanism for medical document OCR using DeepSeek-OCR. We introduce seven masking strategies (V3-V9) targeting different architectural layers (SAM encoder blocks, compression layers, dual vision encoders, projector fusion) and evaluate PHI reduction across HIPAA-defined categories using 100 synthetic medical billing statements (drawn from a corpus of 38,517 annotated documents) with perfect ground-truth annotations. All masking strategies converge to 42.9% PHI reduction, successfully suppressing long-form spatially-distributed identifiers (patient names, dates of birth, physical addresses at 100% effectiveness) while failing to prevent short structured identifiers (medical record numbers, social security numbers, email addresses, account numbers at 0% effectiveness). Ablation studies varying mask expansion radius (r=1,2,3) demonstrate that increased spatial coverage does not improve reduction beyond this ceiling, indicating that language model contextual inference - not insufficient visual masking - drives structured identifier leakage. A simulated hybrid architecture combining vision masking with NLP post-processing achieves 88.6% total PHI reduction (assuming 80% NLP accuracy on remaining identifiers). This negative result establishes boundaries for vision-only privacy interventions in VLMs, provides guidance distinguishing PHI types amenable to vision-level versus language-level redaction, and redirects future research toward decoder-level fine-tuning and hybrid defense-in-depth architectures for HIPAA-compliant medical document processing.

Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation

TL;DR

Abstract

Vision Token Masking Alone Cannot Prevent PHI Leakage in Medical Document OCR: A Systematic Evaluation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)