Table of Contents
Fetching ...

Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors

Jian Wang, Yinpei Dai, Yichi Zhang, Ziqiao Ma, Wenjie Li, Joyce Chai

TL;DR

Task-tutoring with LLMs faces grounding and personalization challenges in real-world tasks. The authors introduce Trace-and-Verify (Traver), a workflow that combines explicit knowledge tracing with a turn-by-turn verifier to guide tutor utterances toward task completion, and the Dict automatic evaluation protocol that uses simulated students and automated unit tests for scalable benchmarking. Empirical results on EvoCodeBench show Traver improves tutoring outcomes over vanilla baselines, narrows the gap to an Oracle, and supports inference-time scaling by evaluating multiple candidate utterances per turn. The work lays a path toward scalable, task-focused tutoring beyond coding and highlights opportunities for future human-in-the-loop validation and broader applications.

Abstract

Intelligent tutoring agents powered by large language models (LLMs) have been increasingly explored to deliver personalized knowledge in areas such as language learning and science education. However, their capabilities in guiding users to solve complex real-world tasks remain underexplored. To address this limitation, in this work, we focus on coding tutoring, a challenging problem that requires tutors to proactively guide students towards completing predefined coding tasks. We propose a novel agent workflow, Trace-and-Verify (TRAVER), which combines knowledge tracing to estimate a student's knowledge state and turn-by-turn verification to ensure effective guidance toward task completion. We introduce DICT, an automatic evaluation protocol that assesses tutor agents using controlled student simulation and code generation tests. Extensive experiments reveal the challenges of coding tutoring and demonstrate that TRAVER achieves a significantly higher success rate. Although we use code tutoring as an example in this paper, our approach can be extended beyond coding, providing valuable insights into advancing tutoring agents for human task learning.

Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors

TL;DR

Task-tutoring with LLMs faces grounding and personalization challenges in real-world tasks. The authors introduce Trace-and-Verify (Traver), a workflow that combines explicit knowledge tracing with a turn-by-turn verifier to guide tutor utterances toward task completion, and the Dict automatic evaluation protocol that uses simulated students and automated unit tests for scalable benchmarking. Empirical results on EvoCodeBench show Traver improves tutoring outcomes over vanilla baselines, narrows the gap to an Oracle, and supports inference-time scaling by evaluating multiple candidate utterances per turn. The work lays a path toward scalable, task-focused tutoring beyond coding and highlights opportunities for future human-in-the-loop validation and broader applications.

Abstract

Intelligent tutoring agents powered by large language models (LLMs) have been increasingly explored to deliver personalized knowledge in areas such as language learning and science education. However, their capabilities in guiding users to solve complex real-world tasks remain underexplored. To address this limitation, in this work, we focus on coding tutoring, a challenging problem that requires tutors to proactively guide students towards completing predefined coding tasks. We propose a novel agent workflow, Trace-and-Verify (TRAVER), which combines knowledge tracing to estimate a student's knowledge state and turn-by-turn verification to ensure effective guidance toward task completion. We introduce DICT, an automatic evaluation protocol that assesses tutor agents using controlled student simulation and code generation tests. Extensive experiments reveal the challenges of coding tutoring and demonstrate that TRAVER achieves a significantly higher success rate. Although we use code tutoring as an example in this paper, our approach can be extended beyond coding, providing valuable insights into advancing tutoring agents for human task learning.

Paper Structure

This paper contains 43 sections, 7 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 1: An illustration of coding tutoring, where a tutor aims to proactively guide students toward completing a target coding task while adapting to students' varying levels of background knowledge.
  • Figure 2: Traver with the trained verifier shows inference-time scaling for coding tutoring (detailed in §\ref{['sec:scaling_analysis']}). Left: Performance vs. sampled candidate utterances per turn. Right: Performance vs. total tokens consumed per tutoring session.
  • Figure 3: Overview of our work for developing coding tutoring agents. Left: The context of the coding tutoring problem. Middle: Trace-and-Verify (Traver) workflow. Right: Dict evaluation protocol.
  • Figure 4: Tutoring outcome curves in Pass rate across various LLM-based tutors with Vanilla Instruct.
  • Figure 5: Comparison of tutoring outcome curves between the TreeInstruct and Traver (Ours).
  • ...and 4 more figures