Not Everyone Wins with LLMs: Behavioral Patterns and Pedagogical Implications for AI Literacy in Programmatic Data Science
Qianou Ma, Kenneth Koedinger, Tongshuang Wu
TL;DR
The paper investigates whether large language models (LLMs) democratize programmatic data science and finds that experience matters: under time pressure LLMs can close performance gaps for less-experienced students, but with more time the technical background still predicts success. Using a rich, mixed-method classroom study with logs, surveys, and think-aloud data, the authors develop an LLM-assisted log annotation codebook to characterize AI-use behaviors across episodes and four knowledge dimensions. They show that technically experienced students use AI more strategically (clear prompts, planning, and explanation), while novices rely on AI for immediate debugging; demonstrations and longer task time improve some AI-use skills, but evaluative skills require targeted training. The work contributes both empirical insights into AI literacy as a set of transferable competencies and practical guidance for curricula and tool design to support durable, effective human–AI collaboration in data science. Overall, the study highlights that successful AI-enabled data analysis hinges on structured training that fosters metacognitive, conceptual, procedural, and dispositional AI-use skills, not merely surface familiarity with AI tools.
Abstract
LLMs promise to democratize technical work in complex domains like programmatic data analysis, but not everyone benefits equally. We study how students with varied experiences use LLMs to complete Python-based data analysis in computational notebooks in a graduate course. Drawing on homework logs, recordings, and surveys from 36 students, we ask: Which experience matters most, and how does it shape AI use? Our mixed-methods analysis shows that technical experience -- not AI familiarity or communication skills -- remains a significant predictor of success. Students also vary widely in how they leverage LLMs, struggling at stages of forming intent, expressing inputs, interpreting outputs, and assessing results. We identify success and failure behaviors, such as providing context or decomposing prompts, that distinguish effective use. These findings inform AI literacy interventions, highlighting that lightweight demonstrations improve surface fluency but are insufficient; deeper training and scaffolds are needed to cultivate resilient AI use skills.
