Table of Contents
Fetching ...

From Tool to Teammate: LLM Coding Agents as Collaborative Partners for Behavioral Labeling in Educational Dialogue Analysis

Eason Chen, Isabel Wang, Nina Yuan, Sophia Judicke, Kayla Beigh, Xinyi Tang

Abstract

Behavioral analysis of tutoring dialogues is essential for understanding student learning, yet manual coding remains a bottleneck. We present a methodology where LLM coding agents autonomously improve the prompts used by LLM classifiers to label educational dialogues. In each iteration, a coding agent runs the classifier against human-labeled validation data, analyzes disagreements, and proposes theory-grounded prompt modifications for researcher review. Applying this approach to 659 AI tutoring sessions across four experiments with three agents and three classifiers, 4-fold cross-validation on held-out data confirmed genuine improvement: the best agent achieved test $κ=0.78$ (SD$=0.08$), matching human inter-rater reliability ($κ=0.78$), at a cost of approximately \$5--8 per agent. While development-set performance reached $κ=0.91$--$0.93$, the cross-validated results represent our primary generalization claim. The iterative process also surfaced an undocumented labeling pattern: human coders consistently treated expressions of confusion as engagement rather than disengagement. Continued iteration beyond the optimum led to regression, underscoring the need for held-out validation. We release all prompts, iteration logs, and data.

From Tool to Teammate: LLM Coding Agents as Collaborative Partners for Behavioral Labeling in Educational Dialogue Analysis

Abstract

Behavioral analysis of tutoring dialogues is essential for understanding student learning, yet manual coding remains a bottleneck. We present a methodology where LLM coding agents autonomously improve the prompts used by LLM classifiers to label educational dialogues. In each iteration, a coding agent runs the classifier against human-labeled validation data, analyzes disagreements, and proposes theory-grounded prompt modifications for researcher review. Applying this approach to 659 AI tutoring sessions across four experiments with three agents and three classifiers, 4-fold cross-validation on held-out data confirmed genuine improvement: the best agent achieved test (SD), matching human inter-rater reliability (), at a cost of approximately \κ=0.910.93$, the cross-validated results represent our primary generalization claim. The iterative process also surfaced an undocumented labeling pattern: human coders consistently treated expressions of confusion as engagement rather than disengagement. Continued iteration beyond the optimum led to regression, underscoring the need for held-out validation. We release all prompts, iteration logs, and data.

Paper Structure

This paper contains 48 sections, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Screenshot of the chatbot interface illustrating answer-seeking and escalation behaviors. Left: A student requests an explanation, and the chatbot responds with step-by-step guidance and a reflection prompt. Right: Despite receiving pedagogical scaffolding, the student continues asking for the answer to the same problem, illustrating escalation behavior where students bypass guided reasoning to seek direct solutions.
  • Figure 2: Overall $\kappa$ progression across iterations. Codex and Gemini both reached $\kappa=0.93$. Claude Code reached $\kappa=0.91$ at v5. Continued iteration can lead to regression. Dashed line shows human baseline ($\kappa=0.89$).
  • Figure 3: Final performance by label dimension. Gemini achieved highest Follow-up Type (0.87), while Codex led on Topic Type (0.85).
  • Figure 4: Follow-up Type progression. All agents started at $\kappa \approx 0.45$--$0.55$. Codex achieved a breakthrough at v7 (0.83), and Gemini reached $\kappa=0.87$ at v9. Claude Code peaked early at v5 (0.88).
  • Figure 5: Iteration dynamics across classifier models. Left: Claude Opus 4.5 classifier. Right: GPT-5.2 classifier. Codex achieves $\kappa=0.93$ with both classifiers. Gemini peaks at $0.93$ with GPT but only $0.85$ with Claude.
  • ...and 1 more figures