Table of Contents
Fetching ...

How do data owners say no? A case study of data consent mechanisms in web-scraped vision-language AI training datasets

Chung Peng Lee, Rachel Hong, Harry H. Jiang, Aster Plotnik, William Agnew, Jamie Morgenstern

TL;DR

This study audits data consent mechanisms in web-scraped vision-language datasets using DataComp CommonPool as a case study, revealing that data owners convey consent through multiple channels, including copyright notices, metadata, and watermarks, while site-level policies via ToS and REP often restrict scraping. The authors combine sample-level and web-domain-level analyses to quantify consent indicators across scales, finding that a substantial portion of data signals indicate restrictions, yet current AI data collection pipelines frequently fail to respect or uniformly interpret these signals. They demonstrate significant gaps in the release practice, such as the absence of page URLs and reliance on scraped URLs rather than direct assets, which complicates provenance and enforcement of consent. The paper argues for a unified data consent framework with opt-in mechanisms to improve transparency, provenance, and respect for data owners, highlighting practical implications for dataset curators, users, and policy discussions in AI training.

Abstract

The internet has become the main source of data to train modern text-to-image or vision-language models, yet it is increasingly unclear whether web-scale data collection practices for training AI systems adequately respect data owners' wishes. Ignoring the owner's indication of consent around data usage not only raises ethical concerns but also has recently been elevated into lawsuits around copyright infringement cases. In this work, we aim to reveal information about data owners' consent to AI scraping and training, and study how it's expressed in DataComp, a popular dataset of 12.8 billion text-image pairs. We examine both the sample-level information, including the copyright notice, watermarking, and metadata, and the web-domain-level information, such as a site's Terms of Service (ToS) and Robots Exclusion Protocol. We estimate at least 122M of samples exhibit some indication of copyright notice in CommonPool, and find that 60\% of the samples in the top 50 domains come from websites with ToS that prohibit scraping. Furthermore, we estimate 9-13\% with 95\% confidence interval of samples from CommonPool to contain watermarks, where existing watermark detection methods fail to capture them in high fidelity. Our holistic methods and findings show that data owners rely on various channels to convey data consent, of which current AI data collection pipelines do not entirely respect. These findings highlight the limitations of the current dataset curation/release practice and the need for a unified data consent framework taking AI purposes into consideration.

How do data owners say no? A case study of data consent mechanisms in web-scraped vision-language AI training datasets

TL;DR

This study audits data consent mechanisms in web-scraped vision-language datasets using DataComp CommonPool as a case study, revealing that data owners convey consent through multiple channels, including copyright notices, metadata, and watermarks, while site-level policies via ToS and REP often restrict scraping. The authors combine sample-level and web-domain-level analyses to quantify consent indicators across scales, finding that a substantial portion of data signals indicate restrictions, yet current AI data collection pipelines frequently fail to respect or uniformly interpret these signals. They demonstrate significant gaps in the release practice, such as the absence of page URLs and reliance on scraped URLs rather than direct assets, which complicates provenance and enforcement of consent. The paper argues for a unified data consent framework with opt-in mechanisms to improve transparency, provenance, and respect for data owners, highlighting practical implications for dataset curators, users, and policy discussions in AI training.

Abstract

The internet has become the main source of data to train modern text-to-image or vision-language models, yet it is increasingly unclear whether web-scale data collection practices for training AI systems adequately respect data owners' wishes. Ignoring the owner's indication of consent around data usage not only raises ethical concerns but also has recently been elevated into lawsuits around copyright infringement cases. In this work, we aim to reveal information about data owners' consent to AI scraping and training, and study how it's expressed in DataComp, a popular dataset of 12.8 billion text-image pairs. We examine both the sample-level information, including the copyright notice, watermarking, and metadata, and the web-domain-level information, such as a site's Terms of Service (ToS) and Robots Exclusion Protocol. We estimate at least 122M of samples exhibit some indication of copyright notice in CommonPool, and find that 60\% of the samples in the top 50 domains come from websites with ToS that prohibit scraping. Furthermore, we estimate 9-13\% with 95\% confidence interval of samples from CommonPool to contain watermarks, where existing watermark detection methods fail to capture them in high fidelity. Our holistic methods and findings show that data owners rely on various channels to convey data consent, of which current AI data collection pipelines do not entirely respect. These findings highlight the limitations of the current dataset curation/release practice and the need for a unified data consent framework taking AI purposes into consideration.

Paper Structure

This paper contains 39 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The life cycle of curating, releasing, and using the web-scraped VLD. Even though the Dataset Curator initially downloads the image assets in their curation process, the released samples only contain the caption, src url pointing to the image asset, and image metadata. To access the dataset, the Dataset User must download the images following the released URLs. The red tags on each step indicate the data consent mechanism we consider involved.
  • Figure 2: Terms of Service annotations. The full population in each chart is all samples in the top 50 base domains of small-en. The portion is determined by the exact number of samples in each type. For License Type, "Not Applicable" indicates that the ToS from the base domain does not specify or provide any license type information. For Category, "Other" indicates that the base domain is for a very domain-specific service. For instance, 4sqi.net is delivered by Foursquare, a location-intelligence service provider.
  • Figure 3: Regular expression search patterns used to source copyright notice in samples' captions and OCR-extracted texts.
  • Figure 4: Distribution of the top 50 base domains in the small-en and medium-en splits of CommonPool. We observe the top 50 base domains only differ by one, where small-en has imgix.net and medium-en has mzstatic.com.