Permissive-Washing in the Open AI Supply Chain: A Large-Scale Audit of License Integrity

James Jewitt; Gopi Krishnan Rajbahadur; Hao Li; Bram Adams; Ahmed E. Hassan

Permissive-Washing in the Open AI Supply Chain: A Large-Scale Audit of License Integrity

James Jewitt, Gopi Krishnan Rajbahadur, Hao Li, Bram Adams, Ahmed E. Hassan

TL;DR

This paper defines permissive-washing as the mismatch between permissive licensing labels and the actual legal payload (license text and notices) required to enforce those licenses in AI artifacts. It introduces a three-stage pipeline—data collection, license detection, and auditing—and constructs a provenance graph of $124{,}278$ dataset→model→application chains across $3{,}338$ datasets, $6{,}664$ models, and $28{,}516$ applications to quantify payload presence and attribution. The authors report a pervasive payload gap: only $2.3 ext{ } ext{percent}$ of datasets and $3.2 ext{ } ext{percent}$ of models include both full license text and a rights-holder notice, with attribution rarely propagating downstream (e.g., $5.75 ext{ } ext{percent}$ of model→application links). They also reveal a wider compliance-payload gap rooted in a structural lack of license files upstream (e.g., $96.5 ext{ } ext{percent}$ of datasets and $93.4 ext{ } ext{percent}$ of models missing LICENSE files) and heterogeneity across platforms (GitHub applications far more compliant than Hugging Face artifacts). The work offers a replication package to enable ongoing verification and emphasizes the need for durable, verifiable license documentation to reduce downstream legal risk in AI systems.

Abstract

Permissive licenses like MIT, Apache-2.0, and BSD-3-Clause dominate open-source AI, signaling that artifacts like models, datasets, and code can be freely used, modified, and redistributed. However, these licenses carry mandatory requirements: include the full license text, provide a copyright notice, and preserve upstream attribution, that remain unverified at scale. Failure to meet these conditions can place reuse outside the scope of the license, effectively leaving AI artifacts under default copyright for those uses and exposing downstream users to litigation. We call this phenomenon ``permissive washing'': labeling AI artifacts as free to use, while omitting the legal documentation required to make that label actionable. To assess how widespread permissive washing is in the AI supply chain, we empirically audit 124,278 dataset $\rightarrow$ model $\rightarrow$ application supply chains, spanning 3,338 datasets, 6,664 models, and 28,516 applications across Hugging Face and GitHub. We find that an astonishing 96.5\% of datasets and 95.8\% of models lack the required license text, only 2.3\% of datasets and 3.2\% of models satisfy both license text and copyright requirements, and even when upstream artifacts provide complete licensing evidence, attribution rarely propagates downstream: only 27.59\% of models preserve compliant dataset notices and only 5.75\% of applications preserve compliant model notices (with just 6.38\% preserving any linked upstream notice). Practitioners cannot assume permissive labels confer the rights they claim: license files and notices, not metadata, are the source of legal truth. To support future research, we release our full audit dataset and reproducible pipeline.

Permissive-Washing in the Open AI Supply Chain: A Large-Scale Audit of License Integrity

TL;DR

dataset→model→application chains across

datasets,

models, and

applications to quantify payload presence and attribution. The authors report a pervasive payload gap: only

of datasets and

of models include both full license text and a rights-holder notice, with attribution rarely propagating downstream (e.g.,

of model→application links). They also reveal a wider compliance-payload gap rooted in a structural lack of license files upstream (e.g.,

of datasets and

of models missing LICENSE files) and heterogeneity across platforms (GitHub applications far more compliant than Hugging Face artifacts). The work offers a replication package to enable ongoing verification and emphasizes the need for durable, verifiable license documentation to reduce downstream legal risk in AI systems.

Abstract

model

application supply chains, spanning 3,338 datasets, 6,664 models, and 28,516 applications across Hugging Face and GitHub. We find that an astonishing 96.5\% of datasets and 95.8\% of models lack the required license text, only 2.3\% of datasets and 3.2\% of models satisfy both license text and copyright requirements, and even when upstream artifacts provide complete licensing evidence, attribution rarely propagates downstream: only 27.59\% of models preserve compliant dataset notices and only 5.75\% of applications preserve compliant model notices (with just 6.38\% preserving any linked upstream notice). Practitioners cannot assume permissive labels confer the rights they claim: license files and notices, not metadata, are the source of legal truth. To support future research, we release our full audit dataset and reproducible pipeline.

Paper Structure (35 sections, 2 figures, 10 tables)

This paper contains 35 sections, 2 figures, 10 tables.

Introduction
Related Work
Methodology
Stage 1: Data Collection
Stage 2: License Detection
Stage 3: Auditing
License Integrity Audit
License Attribution Audit
License Integrity Audit
License Attribution Audit
Compliance Payload Gap
Discussion and Limitations
Legal Perspectives on Risk Exposure from Permissive Washing
Training legality risk: Permissive washing amplifies uncertainty
Weights/derivation risk: The payload gap severs the chain of rights
...and 20 more sections

Figures (2)

Figure 1: An overview of our three-stage methodology: data collection, license detection, and compliance auditing.
Figure 2: Official Hugging Face documentation regarding license specification.

Permissive-Washing in the Open AI Supply Chain: A Large-Scale Audit of License Integrity

TL;DR

Abstract

Permissive-Washing in the Open AI Supply Chain: A Large-Scale Audit of License Integrity

Authors

TL;DR

Abstract

Table of Contents

Figures (2)