Rethinking Data Protection in the (Generative) Artificial Intelligence Era

Yiming Li; Shuo Shao; Yu He; Junfeng Guo; Tianwei Zhang; Zhan Qin; Pin-Yu Chen; Michael Backes; Philip Torr; Dacheng Tao; Kui Ren

Rethinking Data Protection in the (Generative) Artificial Intelligence Era

Yiming Li, Shuo Shao, Yu He, Junfeng Guo, Tianwei Zhang, Zhan Qin, Pin-Yu Chen, Michael Backes, Philip Torr, Dacheng Tao, Kui Ren

TL;DR

This work addresses the ambiguous scope of data protection in the era of generative AI by introducing a four-level taxonomy (non-usability, privacy-preservation, traceability, deletability) that spans the entire AI lifecycle from training data to model outputs. It maps concrete technical approaches to each level, surveys regulatory alignments and gaps, and discusses cross-jurisdictional and ethical implications. The paper argues that a lifecycle-spanning, governance-oriented framework is essential for trustworthy AI, enabling clearer rights, accountability, and deletion mechanisms. Overall, it provides a structured lens for developers, researchers, and regulators to harmonize data practices with evolving AI capabilities.

Abstract

The (generative) artificial intelligence (AI) era has profoundly reshaped the meaning and value of data. No longer confined to static content, data now permeates every stage of the AI lifecycle from the training samples that shape model parameters to the prompts and outputs that drive real-world model deployment. This shift renders traditional notions of data protection insufficient, while the boundaries of what needs safeguarding remain poorly defined. Failing to safeguard data in AI systems can inflict societal and individual, underscoring the urgent need to clearly delineate the scope of and rigorously enforce data protection. In this perspective, we propose a four-level taxonomy, including non-usability, privacy preservation, traceability, and deletability, that captures the diverse protection needs arising in modern (generative) AI models and systems. Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline, including training datasets, model weights, system prompts, and AI-generated content. We analyze representative technical approaches at each level and reveal regulatory blind spots that leave critical assets exposed. By offering a structured lens to align future AI technologies and governance with trustworthy data practices, we underscore the urgency of rethinking data protection for modern AI techniques and provide timely guidance for developers, researchers, and regulators alike.

Rethinking Data Protection in the (Generative) Artificial Intelligence Era

TL;DR

Abstract

Rethinking Data Protection in the (Generative) Artificial Intelligence Era

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)