Preserving Privacy Without Compromising Accuracy: Machine Unlearning for Handwritten Text Recognition
Lei Kang, Xuanshuo Fu, Lluis Gomez, Alicia Fornés, Ernest Valveny, Dimosthenis Karatzas
TL;DR
This work tackles privacy in handwritten text recognition by showing that writer‑specific information is memorized in HTR models. It proposes a two‑stage approach: first prune neurons most associated with forget‑set data, then apply post‑hoc machine unlearning, including the novel Writer‑ID Confusion (WIC) loss to erase writer identity while preserving OCR accuracy. Through extensive experiments on IAM and CVL, the authors demonstrate that their prune‑unlearn pipeline—especially WIC—achieves strong privacy (as measured by low MIA success) with minimal degradation in recognition, outperforming several state‑of‑the‑art MU methods. The results offer a practical, scalable path to privacy‑aware HTR systems and set the stage for broader unlearning applications in document analysis.
Abstract
Handwritten Text Recognition (HTR) is crucial for document digitization, but handwritten data can contain user-identifiable features, like unique writing styles, posing privacy risks. Regulations such as the ``right to be forgotten'' require models to remove these sensitive traces without full retraining. We introduce a practical encoder-only transformer baseline as a robust reference for future HTR research. Building on this, we propose a two-stage unlearning framework for multihead transformer HTR models. Our method combines neural pruning with machine unlearning applied to a writer classification head, ensuring sensitive information is removed while preserving the recognition head. We also present Writer-ID Confusion (WIC), a method that forces the forget set to follow a uniform distribution over writer identities, unlearning user-specific cues while maintaining text recognition performance. We compare WIC to Random Labeling, Fisher Forgetting, Amnesiac Unlearning, and DELETE within our prune-unlearn pipeline and consistently achieve better privacy and accuracy trade-offs. This is the first systematic study of machine unlearning for HTR. Using metrics such as Accuracy, Character Error Rate (CER), Word Error Rate (WER), and Membership Inference Attacks (MIA) on the IAM and CVL datasets, we demonstrate that our method achieves state-of-the-art or superior performance for effective unlearning. These experiments show that our approach effectively safeguards privacy without compromising accuracy, opening new directions for document analysis research. Our code is publicly available at https://github.com/leitro/WIC-WriterIDConfusion-MachineUnlearning.
