In-class Data Analysis Replications: Teaching Students while Testing Science

Kristina Gligoric; Tiziano Piccardi; Jake Hofman; Robert West

In-class Data Analysis Replications: Teaching Students while Testing Science

Kristina Gligoric, Tiziano Piccardi, Jake Hofman, Robert West

TL;DR

Facing the reproducibility crisis, the paper tests in-class data analysis replications as a scalable pedagogy in EPFL's CS-401. It implements a five-step replication pipeline across 10 papers with presurveys and postsurveys to preregister hypotheses. Results show high replication success (98% basic, 87% advanced) and time costs that exceed initial estimates, with significant attitude shifts toward reproducibility; creative extensions further reveal more advanced methods and potential scientific impact. The work provides pragmatic guidance for educators on logistics, staffing, ethics, and grading, arguing that replication tasks can boost scientific reproducibility as a by-product of data science instruction.

Abstract

Science is facing a reproducibility crisis. Previous work has proposed incorporating data analysis replications into classrooms as a potential solution. However, despite the potential benefits, it is unclear whether this approach is feasible, and if so, what the involved stakeholders-students, educators, and scientists-should expect from it. Can students perform a data analysis replication over the course of a class? What are the costs and benefits for educators? And how can this solution help benchmark and improve the state of science? In the present study, we incorporated data analysis replications in the project component of the Applied Data Analysis course (CS-401) taught at EPFL (N=354 students). Here we report pre-registered findings based on surveys administered throughout the course. First, we demonstrate that students can replicate previously published scientific papers, most of them qualitatively and some exactly. We find discrepancies between what students expect of data analysis replications and what they experience by doing them along with changes in expectations about reproducibility, which together serve as evidence of attitude shifts to foster students' critical thinking. Second, we provide information for educators about how much overhead is needed to incorporate replications into the classroom and identify concerns that replications bring as compared to more traditional assignments. Third, we identify tangible benefits of the in-class data analysis replications for scientific communities, such as a collection of replication reports and insights about replication barriers in scientific work that should be avoided going forward. Overall, we demonstrate that incorporating replication tasks into a large data science class can increase the reproducibility of scientific work as a by-product of data science instruction, thus benefiting both science and students.

In-class Data Analysis Replications: Teaching Students while Testing Science

TL;DR

Abstract

Paper Structure (20 sections, 10 figures, 5 tables)

This paper contains 20 sections, 10 figures, 5 tables.

Introduction
Methods
Study Design
Inclusion and Exclusion Criteria
Consent Statement and Information Sheet
Results: Data analysis replications
Preregistered Findings: Discrepancies Between Expectations and the Reality of Data Analysis Replication
Exploratory Findings: Understanding the Students' Experience
Results: Creative replication extensions
Considerations for educators
Logistics
Human resources
Added constraints
Ethical challenges
Grading
...and 5 more sections

Figures (10)

Figure 1: Study design summary. The timeline is visualized from the students' perspective. The semester progresses from the left to the right. The surveys were administered upon submission of the respective assignment step.
Figure 2: Data analysis types, between years. Histogram of the data analysis type across projects, in 2020, the year of creative replication extensions (blue), and 2021, the year on unconstrained projects (orange). Error-bars mark bootstrapped 95% CI. Creative replication extensions are more technically advanced than unconstrained projects, as captured by a decreased use of less advanced descriptive methods (A), and an increased use of more advanced causal data analysis methods (D).
Figure A1:
Figure B1:
Figure A2:
...and 5 more figures

In-class Data Analysis Replications: Teaching Students while Testing Science

TL;DR

Abstract

In-class Data Analysis Replications: Teaching Students while Testing Science

Authors

TL;DR

Abstract

Table of Contents

Figures (10)