Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning

Eason Chen; Sophia Judicke; Kayla Beigh; Xinyi Tang; Isabel Wang; Nina Yuan; Zimo Xiao; Chuangji Li; Shizhuo Li; Reed Luttmer; Shreya Singh; Maria Yampolsky; Naman Parikh; Yvonne Zhao; Meiyi Chen; Scarlett Huang; Anishka Mohanty; Gregory Johnson; John Mackey; Jionghao Lin; Ken Koedinger

Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning

Eason Chen, Sophia Judicke, Kayla Beigh, Xinyi Tang, Isabel Wang, Nina Yuan, Zimo Xiao, Chuangji Li, Shizhuo Li, Reed Luttmer, Shreya Singh, Maria Yampolsky, Naman Parikh, Yvonne Zhao, Meiyi Chen, Scarlett Huang, Anishka Mohanty, Gregory Johnson, John Mackey, Jionghao Lin, Ken Koedinger

TL;DR

It is suggested that chatbot-based support alone may not reliably support transfer to independent assessment of math proof-learning outcomes, whereas work-anchored, structured feedback appears less associated with reduced learning.

Abstract

We evaluate GPTutor, an LLM-powered tutoring system for an undergraduate discrete mathematics course. It integrates two LLM-supported tools: a structured proof-review tool that provides embedded feedback on students' written proof attempts, and a chatbot for math questions. In a staggered-access study with 148 students, earlier access was associated with higher homework performance during the interval when only the experimental group could use the system, while we did not observe this performance increase transfer to exam scores. Usage logs show that students with lower self-efficacy and prior exam performance used both components more frequently. Session-level behavioral labels, produced by human coding and scaled using an automated classifier, characterize how students engaged with the chatbot (e.g., answer-seeking or help-seeking). In models controlling for prior performance and self-efficacy, higher chatbot usage and answer-seeking behavior were negatively associated with subsequent midterm performance, whereas proof-review usage showed no detectable independent association. Together, the findings suggest that chatbot-based support alone may not reliably support transfer to independent assessment of math proof-learning outcomes, whereas work-anchored, structured feedback appears less associated with reduced learning.

Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning

TL;DR

Abstract

Paper Structure (23 sections, 4 figures)

This paper contains 23 sections, 4 figures.

Introduction
Related Work
LLM-Based Educational Support and Overreliance
AI Support for Proof Writing
Self-Efficacy and Help-Seeking in AI-Supported Learning
System Developments
Proof-Review-GPTutor to Guide Students' Open-Ended Proof Writing
AI Chatbot for Answering Students' Questions
Methods
Participants and Study Design
GPTutor Use in the Course
Measures and Data Sources
Analysis Strategy
Results
Access-level performance patterns: homework improves but exams do not
...and 8 more sections

Figures (4)

Figure 1: The interface of Proof-Review-GPTutor, where it reviews student homework proof assignments.
Figure 2: Screenshot of the chatbot interface illustrating answer-seeking and escalation behaviors. Left: A student requests an explanation, and the chatbot responds with step-by-step guidance and a reflection prompt. Right: Despite receiving pedagogical scaffolding, the student continues asking for the answer to the same problem, illustrating escalation behavior where students bypass guided reasoning to seek direct solutions.
Figure 3: Normalized homework scores by condition over time. During the period before Midterm 2, when only the experimental group had access to GPTutor, the experimental group achieved higher homework scores on average.
Figure 4: Serial statistical mediation model showing associations among self-efficacy, prior performance (Midterm 2), GPTutor usage, and Midterm 3 performance. Values on the arrows represent standardized coefficients ($\beta$) and their corresponding $p$-values. Usage frequencies were measured during the Midterm 2$\rightarrow$Midterm 3 interval.

Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning

TL;DR

Abstract

Chat-Based Support Alone May Not Be Enough: Comparing Conversational and Embedded LLM Feedback for Mathematical Proof Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)