Table of Contents
Fetching ...

Web Scraping for Research: Legal, Ethical, Institutional, and Scientific Considerations

Megan A. Brown, Andrew Gruen, Gabe Maldoff, Solomon Messing, Zeve Sanderson, Michael Zimmer

TL;DR

This paper tackles the gap in clear guidance for web scraping in social science by presenting a comprehensive framework that covers legal, ethical, institutional, and scientific dimensions within the U.S. context. It analyzes how contractual terms, statutory regimes, and privacy laws interact with data-access regimes like the GDPR and the EU’s DSA, and it discusses operational realities such as undocumented APIs and browser-plugin data collection. The authors offer pragmatic recommendations, including data-minimization, privacy-impact assessments, early IRB engagement, technical risk management, and a researcher-friendly checklist to plan and communicate about scraping projects. By emphasizing sampling rigor, transparency, and multi-stakeholder coordination, the work aims to balance the benefits of web-based data with protections for individuals and institutions, while promoting standards and training to improve the scientific validity of scraping-derived findings.

Abstract

Scientists across disciplines often use data from the internet to conduct research, generating valuable insights about human behavior. However, as generative AI relying on massive text corpora becomes increasingly valuable, platforms have greatly restricted access to data through official channels. As a result, researchers will likely engage in more web scraping to collect data, introducing new challenges and concerns for researchers. This paper proposes a comprehensive framework for web scraping in social science research for U.S.-based researchers, examining the legal, ethical, institutional, and scientific factors that researchers should consider when scraping the web. We present an overview of the current regulatory environment impacting when and how researchers can access, collect, store, and share data via scraping. We then provide researchers with recommendations to conduct scraping in a scientifically legitimate and ethical manner. We aim to equip researchers with the relevant information to mitigate risks and maximize the impact of their research amidst this evolving data access landscape.

Web Scraping for Research: Legal, Ethical, Institutional, and Scientific Considerations

TL;DR

This paper tackles the gap in clear guidance for web scraping in social science by presenting a comprehensive framework that covers legal, ethical, institutional, and scientific dimensions within the U.S. context. It analyzes how contractual terms, statutory regimes, and privacy laws interact with data-access regimes like the GDPR and the EU’s DSA, and it discusses operational realities such as undocumented APIs and browser-plugin data collection. The authors offer pragmatic recommendations, including data-minimization, privacy-impact assessments, early IRB engagement, technical risk management, and a researcher-friendly checklist to plan and communicate about scraping projects. By emphasizing sampling rigor, transparency, and multi-stakeholder coordination, the work aims to balance the benefits of web-based data with protections for individuals and institutions, while promoting standards and training to improve the scientific validity of scraping-derived findings.

Abstract

Scientists across disciplines often use data from the internet to conduct research, generating valuable insights about human behavior. However, as generative AI relying on massive text corpora becomes increasingly valuable, platforms have greatly restricted access to data through official channels. As a result, researchers will likely engage in more web scraping to collect data, introducing new challenges and concerns for researchers. This paper proposes a comprehensive framework for web scraping in social science research for U.S.-based researchers, examining the legal, ethical, institutional, and scientific factors that researchers should consider when scraping the web. We present an overview of the current regulatory environment impacting when and how researchers can access, collect, store, and share data via scraping. We then provide researchers with recommendations to conduct scraping in a scientifically legitimate and ethical manner. We aim to equip researchers with the relevant information to mitigate risks and maximize the impact of their research amidst this evolving data access landscape.

Paper Structure

This paper contains 44 sections.