Table of Contents
Fetching ...

A Datalake for Data-driven Social Science Research

Puneet Arya, Ojas Sahasrabudhe, Adwaiya Srivastav, Partha Pratim Das, Maya Ramanath

TL;DR

The paper introduces a Datalake architecture tailored for interdisciplinary social science research, addressing barriers of data fragmentation and reproducibility by embedding provenance, version control, and integrated analytics. It details a lifecycle-driven design with data search, ingestion, governance, and visualization capabilities, and demonstrates an end-to-end use case analyzing income, education, and infant mortality across US counties. The results illustrate how provenance and governance enable transparent, reproducible analyses while lowering technical barriers for non-experts. The work argues for broader access to advanced data practices among NGOs and students, with plans to extend functionality to ML pipelines, mobile access, and citizen data feedback.

Abstract

Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets.Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets. In this paper, we present a Datalake infrastructure tailored to the needs of interdisciplinary social science research. Our system supports ingestion and integration of diverse data types, automatic provenance and version tracking, role-based access control, and built-in tools for visualization and analysis. We demonstrate the utility of our Datalake using real-world use cases spanning governance, health, and education. A detailed walkthrough of one such use case -- analyzing the relationship between income, education, and infant mortality -- shows how our platform streamlines the research process while maintaining transparency and reproducibility. We argue that such infrastructure can democratize access to advanced data science practices, especially for NGOs, students, and grassroots organizations. The Datalake continues to evolve with plans to support ML pipelines, mobile access, and citizen data feedback mechanisms.

A Datalake for Data-driven Social Science Research

TL;DR

The paper introduces a Datalake architecture tailored for interdisciplinary social science research, addressing barriers of data fragmentation and reproducibility by embedding provenance, version control, and integrated analytics. It details a lifecycle-driven design with data search, ingestion, governance, and visualization capabilities, and demonstrates an end-to-end use case analyzing income, education, and infant mortality across US counties. The results illustrate how provenance and governance enable transparent, reproducible analyses while lowering technical barriers for non-experts. The work argues for broader access to advanced data practices among NGOs and students, with plans to extend functionality to ML pipelines, mobile access, and citizen data feedback.

Abstract

Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets.Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets. In this paper, we present a Datalake infrastructure tailored to the needs of interdisciplinary social science research. Our system supports ingestion and integration of diverse data types, automatic provenance and version tracking, role-based access control, and built-in tools for visualization and analysis. We demonstrate the utility of our Datalake using real-world use cases spanning governance, health, and education. A detailed walkthrough of one such use case -- analyzing the relationship between income, education, and infant mortality -- shows how our platform streamlines the research process while maintaining transparency and reproducibility. We argue that such infrastructure can democratize access to advanced data science practices, especially for NGOs, students, and grassroots organizations. The Datalake continues to evolve with plans to support ML pipelines, mobile access, and citizen data feedback mechanisms.

Paper Structure

This paper contains 13 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Data Visualization: Household Internet Access
  • Figure 2: Data Visualization: Trends in Women Employment in non-agriculture sector across countries
  • Figure 3: The typical lifecycle of datasets. The boxes highlight technical challenges associated with supporting the various stages of the lifecycle.
  • Figure 4: Data Lifecycle: Family Planning, Employment and Digital Access: Tables 1, 2, 3 are the ingested datasets. Table 4 results from a merge of Tables 1, 2 and 3. Table 5 is a result of selecting specific rows from Table 4 for analysis. The bar chart is derived from Table 5.
  • Figure 5: Components of the Datalake
  • ...and 5 more figures