Table of Contents
Fetching ...

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

Li Lucy, Suchin Gururangan, Luca Soldaini, Emma Strubell, David Bamman, Lauren F. Klein, Jesse Dodge

TL;DR

A range of implicit preferences in data curation is illuminated: it is shown that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world.

Abstract

Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten "quality" and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications.

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

TL;DR

A range of implicit preferences in data curation is illuminated: it is shown that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world.

Abstract

Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten "quality" and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications.
Paper Structure (59 sections, 13 figures, 29 tables)

This paper contains 59 sections, 13 figures, 29 tables.

Figures (13)

  • Figure 1: A paraphrased excerpt from a website's about page, with extracted social dimensions highlighted. We use self-descriptions like this one from Common Crawl, which is frequently used as LLM pretraining data, to examine the social effects of data curation filters.
  • Figure 2: Examples of about web pages' topical interests annotated with cluster centers' top three representative words, obtained using an inverse transformation of cluster centroids and overlaid on a UMAP of pages. Appendix \ref{['appdx:topics']} lists all 50 topical clusters.
  • Figure 3: Common continental subregions in AboutMe. The most frequent countries are the United States, United Kingdom, India, Canada, Australia, China, Germany, New Zealand, Italy, and South Africa (Appendix \ref{['appdx:geo']}).
  • Figure 4: Webpages' use of role-specific words sometimes amplifies model-based filters' preferences. In each filter's plot, roles are bucketed into three tiers of high, mid, and low based on their overall average filter score, where higher values correspond to being less filtered. The first column in each plot is each tier's average filter score, while the second is after subsetting roles only to pages that use more role-specific words than average. Error bars are 95% CI over roles in each tier.
  • Figure 5: Webpage removal rates for each subregion when pages at a bottom percentile are removed by model-based filters, using cutoffs motivated in §\ref{['sec:who']}. Quality ($\filledstar$) and langID ($\ast$) filters in columns, left to right: WikiWebBooks, OpenWeb, WikiRefs, Wiki, Wiki$_{ppl}$, fastText, CLD2, CLD3, and langdetect.
  • ...and 8 more figures