Permissive-Washing in the Open AI Supply Chain: A Large-Scale Audit of License Integrity
James Jewitt, Gopi Krishnan Rajbahadur, Hao Li, Bram Adams, Ahmed E. Hassan
TL;DR
This paper defines permissive-washing as the mismatch between permissive licensing labels and the actual legal payload (license text and notices) required to enforce those licenses in AI artifacts. It introduces a three-stage pipeline—data collection, license detection, and auditing—and constructs a provenance graph of $124{,}278$ dataset→model→application chains across $3{,}338$ datasets, $6{,}664$ models, and $28{,}516$ applications to quantify payload presence and attribution. The authors report a pervasive payload gap: only $2.3 ext{ } ext{percent}$ of datasets and $3.2 ext{ } ext{percent}$ of models include both full license text and a rights-holder notice, with attribution rarely propagating downstream (e.g., $5.75 ext{ } ext{percent}$ of model→application links). They also reveal a wider compliance-payload gap rooted in a structural lack of license files upstream (e.g., $96.5 ext{ } ext{percent}$ of datasets and $93.4 ext{ } ext{percent}$ of models missing LICENSE files) and heterogeneity across platforms (GitHub applications far more compliant than Hugging Face artifacts). The work offers a replication package to enable ongoing verification and emphasizes the need for durable, verifiable license documentation to reduce downstream legal risk in AI systems.
Abstract
Permissive licenses like MIT, Apache-2.0, and BSD-3-Clause dominate open-source AI, signaling that artifacts like models, datasets, and code can be freely used, modified, and redistributed. However, these licenses carry mandatory requirements: include the full license text, provide a copyright notice, and preserve upstream attribution, that remain unverified at scale. Failure to meet these conditions can place reuse outside the scope of the license, effectively leaving AI artifacts under default copyright for those uses and exposing downstream users to litigation. We call this phenomenon ``permissive washing'': labeling AI artifacts as free to use, while omitting the legal documentation required to make that label actionable. To assess how widespread permissive washing is in the AI supply chain, we empirically audit 124,278 dataset $\rightarrow$ model $\rightarrow$ application supply chains, spanning 3,338 datasets, 6,664 models, and 28,516 applications across Hugging Face and GitHub. We find that an astonishing 96.5\% of datasets and 95.8\% of models lack the required license text, only 2.3\% of datasets and 3.2\% of models satisfy both license text and copyright requirements, and even when upstream artifacts provide complete licensing evidence, attribution rarely propagates downstream: only 27.59\% of models preserve compliant dataset notices and only 5.75\% of applications preserve compliant model notices (with just 6.38\% preserving any linked upstream notice). Practitioners cannot assume permissive labels confer the rights they claim: license files and notices, not metadata, are the source of legal truth. To support future research, we release our full audit dataset and reproducible pipeline.
