AspirinSum: an Aspect-based utility-preserved de-identification Summarization framework

Ya-Lun Li

AspirinSum: an Aspect-based utility-preserved de-identification Summarization framework

Ya-Lun Li

TL;DR

AspirinSum tackles domain-specific privacy for LLM training by reframing de-identification as an aspect-based summarization task guided by domain expert notes. It learns expert-aligned aspect tokens via cross-attention and contrastive learning (XAlign), extracts PSA-related sub-sentences using cross-attention and filtering (ASE, ARCSS), and replaces them with similar-utility sub-sentences from a pool to enforce $k$-anonymity (AKS). The framework is validated on the High-School Student's College Application (HSSCA) dataset, showing the approach can preserve downstream task utility and document fidelity while dramatically reducing re-identification risk. These results demonstrate a practical path to publishing domain-specific, de-identified text datasets that remain useful for downstream analysis and model training. The work provides a flexible, expert-informed alternative to predefined PII lists and points to future improvements in chunking strategies and evaluation metrics for text privacy.

Abstract

Due to the rapid advancement of Large Language Model (LLM), the whole community eagerly consumes any available text data in order to train the LLM. Currently, large portion of the available text data are collected from internet, which has been thought as a cheap source of the training data. However, when people try to extend the LLM's capability to the personal related domain, such as healthcare or education, the lack of public dataset in these domains make the adaption of the LLM in such domains much slower. The reason of lacking public available dataset in such domains is because they usually contain personal sensitive information. In order to comply with privacy law, the data in such domains need to be de-identified before any kind of dissemination. It had been much research tried to address this problem for the image or tabular data. However, there was limited research on the efficient and general de-identification method for text data. Most of the method based on human annotation or predefined category list. It usually can not be easily adapted to specific domains. The goal of this proposal is to develop a text de-identification framework, which can be easily adapted to the specific domain, leverage the existing expert knowledge without further human annotation. We propose an aspect-based utility-preserved de-identification summarization framework, AspirinSum, by learning to align expert's aspect from existing comment data, it can efficiently summarize the personal sensitive document by extracting personal sensitive aspect related sub-sentence and de-identify it by substituting it with similar aspect sub-sentence. We envision that the de-identified text can then be used in data publishing, eventually publishing our de-identified dataset for downstream task use.

AspirinSum: an Aspect-based utility-preserved de-identification Summarization framework

TL;DR

-anonymity (AKS). The framework is validated on the High-School Student's College Application (HSSCA) dataset, showing the approach can preserve downstream task utility and document fidelity while dramatically reducing re-identification risk. These results demonstrate a practical path to publishing domain-specific, de-identified text datasets that remain useful for downstream analysis and model training. The work provides a flexible, expert-informed alternative to predefined PII lists and points to future improvements in chunking strategies and evaluation metrics for text privacy.

Abstract

Paper Structure (43 sections, 12 equations, 6 figures, 8 tables)

This paper contains 43 sections, 12 equations, 6 figures, 8 tables.

Introduction
Problem Overview
Dataset Acquisition and Ethical Concerns
Expert-aware Domain-specific Summarization
Removal of Individual Linkages
Dataset Publication and Downstream Task Utility
Research Questions
Related Work
Multi-Perspective Summarization
Aspect discovery
Aspect-based summarization
Privacy-Preserving Methods for Text Data
De-identification
Synthetic Data
Obfuscation
...and 28 more sections

Figures (6)

Figure 1: Conventional methods
Figure 2: Personal Sensitive Aspect
Figure 3: The Proposed Framework of AspirinSum
Figure 4: The details of the proposed XAlign model
Figure 5: The details of the proposed Aspect Relevant Common Sequence Selection
...and 1 more figures

AspirinSum: an Aspect-based utility-preserved de-identification Summarization framework

TL;DR

Abstract

AspirinSum: an Aspect-based utility-preserved de-identification Summarization framework

Authors

TL;DR

Abstract

Table of Contents

Figures (6)