A Datalake for Data-driven Social Science Research
Puneet Arya, Ojas Sahasrabudhe, Adwaiya Srivastav, Partha Pratim Das, Maya Ramanath
TL;DR
The paper introduces a Datalake architecture tailored for interdisciplinary social science research, addressing barriers of data fragmentation and reproducibility by embedding provenance, version control, and integrated analytics. It details a lifecycle-driven design with data search, ingestion, governance, and visualization capabilities, and demonstrates an end-to-end use case analyzing income, education, and infant mortality across US counties. The results illustrate how provenance and governance enable transparent, reproducible analyses while lowering technical barriers for non-experts. The work argues for broader access to advanced data practices among NGOs and students, with plans to extend functionality to ML pipelines, mobile access, and citizen data feedback.
Abstract
Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets.Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets. In this paper, we present a Datalake infrastructure tailored to the needs of interdisciplinary social science research. Our system supports ingestion and integration of diverse data types, automatic provenance and version tracking, role-based access control, and built-in tools for visualization and analysis. We demonstrate the utility of our Datalake using real-world use cases spanning governance, health, and education. A detailed walkthrough of one such use case -- analyzing the relationship between income, education, and infant mortality -- shows how our platform streamlines the research process while maintaining transparency and reproducibility. We argue that such infrastructure can democratize access to advanced data science practices, especially for NGOs, students, and grassroots organizations. The Datalake continues to evolve with plans to support ML pipelines, mobile access, and citizen data feedback mechanisms.
