Table of Contents
Fetching ...

Privacy in Image Datasets: A Case Study on Pregnancy Ultrasounds

Rawisara Lohanimit, Yankun Wu, Amelia Katirai, Yuta Nakashima, Noa Garcia

TL;DR

This paper investigates privacy risks in large-scale, publicly scraped image datasets by conducting a focused audit of pregnancy ultrasound images in LAION-400M. It develops a dual-path approach to detect ultrasound images and to extract embedded private information, using CLIP-based retrieval and a PIU-trained classifier, OCR with correction, and Presidio for entity extraction. The study identifies 833 unique pregnancy ultrasound images in LAION-400M and uncovers substantial private information, with 677 instances across images and notable co-occurrences that elevate re-identification risks. It also reveals clustering of image themes (e.g., baby-related announcements and keepsakes) that often include comprehensive personal data, underscoring the urgency of privacy-preserving data collection, consent, and de-identification techniques in open image datasets. The authors propose concrete recommendations for dataset curation, privacy-preserving training, and ethical governance to mitigate misuse and protect individuals' privacy in reproductive health data.

Abstract

The rise of generative models has led to increased use of large-scale datasets collected from the internet, often with minimal or no data curation. This raises concerns about the inclusion of sensitive or private information. In this work, we explore the presence of pregnancy ultrasound images, which contain sensitive personal information and are often shared online. Through a systematic examination of LAION-400M dataset using CLIP embedding similarity, we retrieve images containing pregnancy ultrasound and detect thousands of entities of private information such as names and locations. Our findings reveal that multiple images have high-risk information that could enable re-identification or impersonation. We conclude with recommended practices for dataset curation, data privacy, and ethical use of public image datasets.

Privacy in Image Datasets: A Case Study on Pregnancy Ultrasounds

TL;DR

This paper investigates privacy risks in large-scale, publicly scraped image datasets by conducting a focused audit of pregnancy ultrasound images in LAION-400M. It develops a dual-path approach to detect ultrasound images and to extract embedded private information, using CLIP-based retrieval and a PIU-trained classifier, OCR with correction, and Presidio for entity extraction. The study identifies 833 unique pregnancy ultrasound images in LAION-400M and uncovers substantial private information, with 677 instances across images and notable co-occurrences that elevate re-identification risks. It also reveals clustering of image themes (e.g., baby-related announcements and keepsakes) that often include comprehensive personal data, underscoring the urgency of privacy-preserving data collection, consent, and de-identification techniques in open image datasets. The authors propose concrete recommendations for dataset curation, privacy-preserving training, and ethical governance to mitigate misuse and protect individuals' privacy in reproductive health data.

Abstract

The rise of generative models has led to increased use of large-scale datasets collected from the internet, often with minimal or no data curation. This raises concerns about the inclusion of sensitive or private information. In this work, we explore the presence of pregnancy ultrasound images, which contain sensitive personal information and are often shared online. Through a systematic examination of LAION-400M dataset using CLIP embedding similarity, we retrieve images containing pregnancy ultrasound and detect thousands of entities of private information such as names and locations. Our findings reveal that multiple images have high-risk information that could enable re-identification or impersonation. We conclude with recommended practices for dataset curation, data privacy, and ethical use of public image datasets.
Paper Structure (24 sections, 6 figures, 4 tables)

This paper contains 24 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Pregnancy ultrasound image detection. For retrieval-based detection, we use image and text as queries to find images that have high similarity with the query. For classifier-based detection, we use the output of the classifier as the prediction.
  • Figure 2: Examples of (a) positive images and (b) negative images in the PIU dataset. Faces and private information redacted for privacy.
  • Figure 3: Private information identification. Detected pregnancy ultrasound images are 1) preprocessed with super-resolution and rotation for horizontal text alignment, 2) processed for text recognition and correction, and 3) passed to a private information detection system to extract Name, Location, Date Time, and Phone Number entities.
  • Figure 4: t-SNE visualization of the pregnancy ultrasound images found in LAION-400M. Colors represent each of the cluster themes (names shown next to each cluster) found with HDBSCAN. Faces and private information redacted for privacy.
  • Figure 5: Number of private information detected in each pregnancy ultrasound within LAION-400M.
  • ...and 1 more figures