Table of Contents
Fetching ...

Rethinking Data Protection in the (Generative) Artificial Intelligence Era

Yiming Li, Shuo Shao, Yu He, Junfeng Guo, Tianwei Zhang, Zhan Qin, Pin-Yu Chen, Michael Backes, Philip Torr, Dacheng Tao, Kui Ren

TL;DR

This work addresses the ambiguous scope of data protection in the era of generative AI by introducing a four-level taxonomy (non-usability, privacy-preservation, traceability, deletability) that spans the entire AI lifecycle from training data to model outputs. It maps concrete technical approaches to each level, surveys regulatory alignments and gaps, and discusses cross-jurisdictional and ethical implications. The paper argues that a lifecycle-spanning, governance-oriented framework is essential for trustworthy AI, enabling clearer rights, accountability, and deletion mechanisms. Overall, it provides a structured lens for developers, researchers, and regulators to harmonize data practices with evolving AI capabilities.

Abstract

The (generative) artificial intelligence (AI) era has profoundly reshaped the meaning and value of data. No longer confined to static content, data now permeates every stage of the AI lifecycle from the training samples that shape model parameters to the prompts and outputs that drive real-world model deployment. This shift renders traditional notions of data protection insufficient, while the boundaries of what needs safeguarding remain poorly defined. Failing to safeguard data in AI systems can inflict societal and individual, underscoring the urgent need to clearly delineate the scope of and rigorously enforce data protection. In this perspective, we propose a four-level taxonomy, including non-usability, privacy preservation, traceability, and deletability, that captures the diverse protection needs arising in modern (generative) AI models and systems. Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline, including training datasets, model weights, system prompts, and AI-generated content. We analyze representative technical approaches at each level and reveal regulatory blind spots that leave critical assets exposed. By offering a structured lens to align future AI technologies and governance with trustworthy data practices, we underscore the urgency of rethinking data protection for modern AI techniques and provide timely guidance for developers, researchers, and regulators alike.

Rethinking Data Protection in the (Generative) Artificial Intelligence Era

TL;DR

This work addresses the ambiguous scope of data protection in the era of generative AI by introducing a four-level taxonomy (non-usability, privacy-preservation, traceability, deletability) that spans the entire AI lifecycle from training data to model outputs. It maps concrete technical approaches to each level, surveys regulatory alignments and gaps, and discusses cross-jurisdictional and ethical implications. The paper argues that a lifecycle-spanning, governance-oriented framework is essential for trustworthy AI, enabling clearer rights, accountability, and deletion mechanisms. Overall, it provides a structured lens for developers, researchers, and regulators to harmonize data practices with evolving AI capabilities.

Abstract

The (generative) artificial intelligence (AI) era has profoundly reshaped the meaning and value of data. No longer confined to static content, data now permeates every stage of the AI lifecycle from the training samples that shape model parameters to the prompts and outputs that drive real-world model deployment. This shift renders traditional notions of data protection insufficient, while the boundaries of what needs safeguarding remain poorly defined. Failing to safeguard data in AI systems can inflict societal and individual, underscoring the urgent need to clearly delineate the scope of and rigorously enforce data protection. In this perspective, we propose a four-level taxonomy, including non-usability, privacy preservation, traceability, and deletability, that captures the diverse protection needs arising in modern (generative) AI models and systems. Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline, including training datasets, model weights, system prompts, and AI-generated content. We analyze representative technical approaches at each level and reveal regulatory blind spots that leave critical assets exposed. By offering a structured lens to align future AI technologies and governance with trustworthy data practices, we underscore the urgency of rethinking data protection for modern AI techniques and provide timely guidance for developers, researchers, and regulators alike.

Paper Structure

This paper contains 11 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Data flow across the life‑cycle of a (generative) AI model. The schematic traces how different forms of data emerge and circulate from the moment raw samples are collected to the point at which a deployed model generates new content. (i) Data Collection and Curation: Samples, such as images, texts, and audio clips are gathered and annotated; once aggregated, they form the training dataset that drives model learning and the testing dataset used for validation. (ii) Model Training: These datasets are transformed into model parameters ($e.g.$, weights and biases), turning the well‑trained model itself into a valuable, model‑centric data asset. (iii) Model Inference: After deployment, users supply inputs or prompts—which may contain private or proprietary information—that the model processes to produce AI‑generated content ranging from class labels to code, images, or full documents. Arrows indicate how each artefact ($e.g.$, dataset, model parameters, prompts, and outputs) can be independently copied, released, or shared, underscoring why all of them must be considered within a comprehensive data‑protection framework.
  • Figure 2: Hierarchical taxonomy of data protection in the (generative) AI era. This taxonomy comprises four distinct protection levels, each representing a trade-off between data usability and the degree of protection provided. At the most stringent level, data non-usability completely restricts the use of specific data in model training and inference, thus offering maximal protection at the cost of total data utility. The next level, data privacy-preservation, allows data use under stringent privacy safeguards, enabling some practical utility while protecting sensitive or private attributes. Moving further, data traceability permits extensive data usage but integrates methods to track data origins and modifications, supporting transparency and accountability with minimal functional interference. At the most permissive level, data deletability places no initial restriction on data usage but mandates mechanisms for fully removing data's influence from trained models post hoc, aligned with principles such as the 'right to be forgotten'. This hierarchical taxonomy helps disambiguate the scope of data protective measures and provides a structured lens to evaluate and further design related regulations in protecting data in the (generative) AI era.
  • Figure 3: Design principles of techniques for each level. Level 1. Non-usability: Encryption and (fine-grained) authorization confine direct data access solely to authorized parties, while techniques such as unlearnable examples and non-transferable learning disable data exploitation in unauthorized domains by mitigating particular data features, thereby achieving non-usability indirectly; Level 2. Privacy-preservation: These techniques generally fall into two main categories: tampering-based and non-tampering-based methods. The former perturbs private portions of the data (occasionally at the cost of tampering with some non-private content), whereas the latter prevents direct access without data modification while preserving data utilities; Level 3. Traceability: Traceability techniques intrusively attach ownership signals ($i.e.$, watermarks) to original data or directly infer provenance and potential modifications non-intrusively by analyzing data's intrinsic information; Level 4. Deletability: The influence of protected data (denoted by 'purple circle' in the sub-figure) can be removed either by excising the data and rebuilding the AI model from scratch to directly change the decision surface (marked in 'black dot-line') or, more efficiently, by targeted unlearning that erases its influence (to the surface) without full model reconstruction, thereby ensuring data deletability.