AspirinSum: an Aspect-based utility-preserved de-identification Summarization framework
Ya-Lun Li
TL;DR
AspirinSum tackles domain-specific privacy for LLM training by reframing de-identification as an aspect-based summarization task guided by domain expert notes. It learns expert-aligned aspect tokens via cross-attention and contrastive learning (XAlign), extracts PSA-related sub-sentences using cross-attention and filtering (ASE, ARCSS), and replaces them with similar-utility sub-sentences from a pool to enforce $k$-anonymity (AKS). The framework is validated on the High-School Student's College Application (HSSCA) dataset, showing the approach can preserve downstream task utility and document fidelity while dramatically reducing re-identification risk. These results demonstrate a practical path to publishing domain-specific, de-identified text datasets that remain useful for downstream analysis and model training. The work provides a flexible, expert-informed alternative to predefined PII lists and points to future improvements in chunking strategies and evaluation metrics for text privacy.
Abstract
Due to the rapid advancement of Large Language Model (LLM), the whole community eagerly consumes any available text data in order to train the LLM. Currently, large portion of the available text data are collected from internet, which has been thought as a cheap source of the training data. However, when people try to extend the LLM's capability to the personal related domain, such as healthcare or education, the lack of public dataset in these domains make the adaption of the LLM in such domains much slower. The reason of lacking public available dataset in such domains is because they usually contain personal sensitive information. In order to comply with privacy law, the data in such domains need to be de-identified before any kind of dissemination. It had been much research tried to address this problem for the image or tabular data. However, there was limited research on the efficient and general de-identification method for text data. Most of the method based on human annotation or predefined category list. It usually can not be easily adapted to specific domains. The goal of this proposal is to develop a text de-identification framework, which can be easily adapted to the specific domain, leverage the existing expert knowledge without further human annotation. We propose an aspect-based utility-preserved de-identification summarization framework, AspirinSum, by learning to align expert's aspect from existing comment data, it can efficiently summarize the personal sensitive document by extracting personal sensitive aspect related sub-sentence and de-identify it by substituting it with similar aspect sub-sentence. We envision that the de-identified text can then be used in data publishing, eventually publishing our de-identified dataset for downstream task use.
