Table of Contents
Fetching ...

Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp

Rachel Hong, William Agnew, Tadayoshi Kohno, Jamie Morgenstern

TL;DR

It is shown that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as “high-quality” data, and a need for fundamental changes in dataset creation and filtering practices is pointed to.

Abstract

As training datasets become increasingly drawn from unstructured, uncontrolled environments such as the web, researchers and industry practitioners have increasingly relied upon data filtering techniques to "filter out the noise" of web-scraped data. While datasets have been widely shown to reflect the biases and values of their creators, in this paper we contribute to an emerging body of research that assesses the filters used to create these datasets. We show that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as "high-quality" data. In our work, we audit a standard approach of image-text CLIP-filtering on the academic benchmark DataComp's CommonPool by analyzing discrepancies of filtering through various annotation techniques across multiple modalities of image, text, and website source. We find that data relating to several imputed demographic groups -- such as LGBTQ+ people, older women, and younger men -- are associated with higher rates of exclusion. Moreover, we demonstrate cases of exclusion amplification: not only are certain marginalized groups already underrepresented in the unfiltered data, but CLIP-filtering excludes data from these groups at higher rates. The data-filtering step in the machine learning pipeline can therefore exacerbate representation disparities already present in the data-gathering step, especially when existing filters are designed to optimize a specifically-chosen downstream performance metric like zero-shot image classification accuracy. Finally, we show that the NSFW filter fails to remove sexually-explicit content from CommonPool, and that CLIP-filtering includes several categories of copyrighted content at high rates. Our conclusions point to a need for fundamental changes in dataset creation and filtering practices.

Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp

TL;DR

It is shown that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as “high-quality” data, and a need for fundamental changes in dataset creation and filtering practices is pointed to.

Abstract

As training datasets become increasingly drawn from unstructured, uncontrolled environments such as the web, researchers and industry practitioners have increasingly relied upon data filtering techniques to "filter out the noise" of web-scraped data. While datasets have been widely shown to reflect the biases and values of their creators, in this paper we contribute to an emerging body of research that assesses the filters used to create these datasets. We show that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as "high-quality" data. In our work, we audit a standard approach of image-text CLIP-filtering on the academic benchmark DataComp's CommonPool by analyzing discrepancies of filtering through various annotation techniques across multiple modalities of image, text, and website source. We find that data relating to several imputed demographic groups -- such as LGBTQ+ people, older women, and younger men -- are associated with higher rates of exclusion. Moreover, we demonstrate cases of exclusion amplification: not only are certain marginalized groups already underrepresented in the unfiltered data, but CLIP-filtering excludes data from these groups at higher rates. The data-filtering step in the machine learning pipeline can therefore exacerbate representation disparities already present in the data-gathering step, especially when existing filters are designed to optimize a specifically-chosen downstream performance metric like zero-shot image classification accuracy. Finally, we show that the NSFW filter fails to remove sexually-explicit content from CommonPool, and that CLIP-filtering includes several categories of copyrighted content at high rates. Our conclusions point to a need for fundamental changes in dataset creation and filtering practices.
Paper Structure (95 sections, 19 figures, 4 tables)

This paper contains 95 sections, 19 figures, 4 tables.

Figures (19)

  • Figure 1: CLIP-filtering pipeline of LAION laionlaion5b and DataComp datacomp as described in \ref{['sec:bkgd:pipeline']}. In step 1, a series of initial data cleaning techniques is applied to CommonCrawl commoncrawl to form the raw dataset $D_r$ with each sample as image $x_i$ and corresponding alt-text tag $y_i$. The filtered dataset $D_f$ is obtained by step 2, which applies the pre-trained OpenAI CLIP model $\theta_f$ to each image-text pair $(x_i, y_i)$. If the cosine similarity score between embeddings $\theta_f(x_i)$ and $\theta_f(y_i)$ is above some predefined threshold $t_{filter}$, then the pair is included in the filtered dataset. From this dataset, step 3 trains the open-source CLIP model, and step 4 applies the CLIP model to various downstream tasks. We scope our investigation to step 2 within the enclosed box.
  • Figure 2: Pass rate (a) and raw dataset frequency (b) broken down by mentions of identity keywords from \ref{['tab:keywords']}. A higher pass rate represents a higher proportion included in the resulting CLIP-filtered dataset. While white and black terms are commonly mentioned, we inspect samples manually and find they relate overwhelmingly to clothing items. Figure (a) shows that CLIP is more likely to exclude text samples containing LGBTQ+ keywords compared to other identity keywords.
  • Figure 3: Heat map of pass rates by intersections of various demographic dimensions where a higher pass rate is a darker shade of blue. "---" indicates that the raw frequency of samples within that intersection is below 10 and therefore not reported. The "total" row and column in each map refer to the pass rate for all samples that mention one identity keyword. For instance, the total, wom[ae]n square shows that 61% of samples that mention keyword wom[ae]n pass the CLIP filter.
  • Figure 4: Common words with widest pass rate differences by gender. The top graph plots the top 20 words with the largest pass rate difference between mentions of women keywords versus mentions of men keywords and hence are more "woman-associated." The bottom graph plots the top 20 words with the largest pass rate difference between men and women and hence are more "man-associated."
  • Figure 5: Frequency for samples detected by Rekognition as containing one face, split by Rekognition-detected age and gender groups. For each imputed group the pass rate is included in parentheses. We see that older detected ages in the Female-imputed group have substantially lower pass rates and lower representation than the corresponding Male-imputed group, and that younger detected ages in the Male-imputed group have lower pass rates and lower representation than the corresponding Female-imputed group.
  • ...and 14 more figures