Table of Contents
Fetching ...

TeachLM: Post-Training LLMs for Education Using Authentic Learning Data

Janos Perczel, Jin Chow, Dorottya Demszky

TL;DR

This work addresses the gap between educational pedagogy and large language models by leveraging authentic student–tutor data to post-train LLMs. It introduces TeachLM, a parameter-efficiently fine-tuned model trained on over 100,000 hours of Polygence interactions and paired with a novel multi-turn evaluation framework that uses a fine-tuned student model to generate scalable synthetic dialogues. The results show that fine-tuning on authentic learning data improves conversational and pedagogical performance, increasing student talk time, refining questioning style, extending dialogue depth, and personalizing instruction, compared with off-the-shelf models. The study validates a scalable, reproducible evaluation workflow and outlines future directions, including RLHF and more nuanced longitudinal benchmarks, to advance AI-assisted education.

Abstract

The promise of generative AI to revolutionize education is constrained by the pedagogical limits of large language models (LLMs). A major issue is the lack of access to high-quality training data that reflect the learning of actual students. Prompt engineering has emerged as a stopgap, but the ability of prompts to encode complex pedagogical strategies in rule-based natural language is inherently limited. To address this gap we introduce TeachLM - an LLM optimized for teaching through parameter-efficient fine-tuning of state-of-the-art models. TeachLM is trained on a dataset comprised of 100,000 hours of one-on-one, longitudinal student-tutor interactions maintained by Polygence, which underwent a rigorous anonymization process to protect privacy. We use parameter-efficient fine-tuning to develop an authentic student model that enables the generation of high-fidelity synthetic student-tutor dialogues. Building on this capability, we propose a novel multi-turn evaluation protocol that leverages synthetic dialogue generation to provide fast, scalable, and reproducible assessments of the dialogical capabilities of LLMs. Our evaluations demonstrate that fine-tuning on authentic learning data significantly improves conversational and pedagogical performance - doubling student talk time, improving questioning style, increasing dialogue turns by 50%, and greater personalization of instruction.

TeachLM: Post-Training LLMs for Education Using Authentic Learning Data

TL;DR

This work addresses the gap between educational pedagogy and large language models by leveraging authentic student–tutor data to post-train LLMs. It introduces TeachLM, a parameter-efficiently fine-tuned model trained on over 100,000 hours of Polygence interactions and paired with a novel multi-turn evaluation framework that uses a fine-tuned student model to generate scalable synthetic dialogues. The results show that fine-tuning on authentic learning data improves conversational and pedagogical performance, increasing student talk time, refining questioning style, extending dialogue depth, and personalizing instruction, compared with off-the-shelf models. The study validates a scalable, reproducible evaluation workflow and outlines future directions, including RLHF and more nuanced longitudinal benchmarks, to advance AI-assisted education.

Abstract

The promise of generative AI to revolutionize education is constrained by the pedagogical limits of large language models (LLMs). A major issue is the lack of access to high-quality training data that reflect the learning of actual students. Prompt engineering has emerged as a stopgap, but the ability of prompts to encode complex pedagogical strategies in rule-based natural language is inherently limited. To address this gap we introduce TeachLM - an LLM optimized for teaching through parameter-efficient fine-tuning of state-of-the-art models. TeachLM is trained on a dataset comprised of 100,000 hours of one-on-one, longitudinal student-tutor interactions maintained by Polygence, which underwent a rigorous anonymization process to protect privacy. We use parameter-efficient fine-tuning to develop an authentic student model that enables the generation of high-fidelity synthetic student-tutor dialogues. Building on this capability, we propose a novel multi-turn evaluation protocol that leverages synthetic dialogue generation to provide fast, scalable, and reproducible assessments of the dialogical capabilities of LLMs. Our evaluations demonstrate that fine-tuning on authentic learning data significantly improves conversational and pedagogical performance - doubling student talk time, improving questioning style, increasing dialogue turns by 50%, and greater personalization of instruction.

Paper Structure

This paper contains 27 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Left: Illustration of the Polygence program and project outcomes. Students meet online with tutors pursuing or holding advanced degrees (PhD, MD, JD, MBA, etc.) to take projects from ideation, to exection, to showcasing. Project outcomes range from academic papers to creating podcasts to engineering physical devices. Topics range from AI to cancer biology to sport analytics. Right: Top 10 institutions represented by the advanced degrees pursued or held by Polygence tutors.
  • Figure 2: A squarified hierarchical map of the distribution of project topics based on a random sample of $n=1,000$ Polygence projects. To create this map, we used a customized version of K-means clustering of project topics based on Anthropic's Clio framework tamkin2024clio and the open-source Kura library 567labs2025kura. The size of each box is proportional to the relative frequency of each topic or topic cluster.
  • Figure 3: Tutoring activity overview for $n=195$ completed 10-session projects. Each 1-hour session is segmented into 5-minute chunks, analyzed individually, and hierarchically clustered into 4 levels using using Anthropic's Clio framework tamkin2024clio and a customized version of the open-source Kura library 567labs2025kura. The distribution of the 7 top-level tutoring categories are shown across sessions. We find a 78% overlap with the top 10 student usage categories of ChatGPT reported by OpenAI openai2025collegechatgpt.
  • Figure 4: End-to-end transcript processing pipeline. Dual-track audio is merged and trimmed, then transcribed with high fidelity. Speaker activity masks enable accurate diarization, followed by a multi-step cleaning process (fix punctuation and grammar, remove backchannels and interruptions, align context and tutor persona, and improve coherence and enforce turn taking) to yield polished tutor–student transcripts.
  • Figure 5: Comparing three core conversational statistics (talk time, questions per turn, words per turn) across four different types of dialogues: human to human, human to GPT-4, base student model (Gemini 2.0 Flash) to GPT-4, and tuned student model (Gemini 2.0 Flash tuned on Polygence student data) to GPT-4. The human-to-GPT-4 conversational data was obtained from our PolyPilot experiment (Section \ref{['section:polypilot']}). We observe that simulated conversations between two prompt-engineered base models (large orange dot) produce significantly different ($p<0.001$) conversational statistics from both dialogues involving only humans (green dashed line) and a human student and AI (red dotted line). Fine-tuning a model on student data (connected blue dots) progressively aligns its conversational statistics with those of actual humans conversing with AI (red dotted line). Humans interacting with AI also produces different conversational statistics than human-to-human conversations, highlighting that the prompt engineered base tutor model (GPT-4) impersonates a human tutor with limited fidelity. These results provide further evidence that simulated dialogues involving a fine-tuned student model approximate human-AI conversations better than conversations generated from two prompt-engineered AI models alone. Error bars and intervals represent $95\%$ confidence intervals calculated with the Student’s t-distribution (light green and light red intervals show the error bars for human-to-human and human-to-GPT-4 conversations respectively).
  • ...and 4 more figures