Language Models as Science Tutors

Alexis Chevalier; Jiayi Geng; Alexander Wettig; Howard Chen; Sebastian Mizera; Toni Annala; Max Jameson Aragon; Arturo Rodríguez Fanlo; Simon Frieder; Simon Machado; Akshara Prabhakar; Ellie Thieu; Jiachen T. Wang; Zirui Wang; Xindi Wu; Mengzhou Xia; Wenhan Xia; Jiatong Yu; Jun-Jie Zhu; Zhiyong Jason Ren; Sanjeev Arora; Danqi Chen

Language Models as Science Tutors

Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Jameson Aragon, Arturo Rodríguez Fanlo, Simon Frieder, Simon Machado, Akshara Prabhakar, Ellie Thieu, Jiachen T. Wang, Zirui Wang, Xindi Wu, Mengzhou Xia, Wenhan Xia, Jiatong Yu, Jun-Jie Zhu, Zhiyong Jason Ren, Sanjeev Arora, Danqi Chen

TL;DR

This work introduces TutorEval, a long-context, cross-disciplinary benchmark for assessing LMs as science tutors, and TutorChat, a large long-context dialogue dataset built from open textbooks to train domain-specific tutors. It demonstrates that fine-tuning on generic dialogue data is insufficient and shows the value of long-context, science-focused training, embodied in the MathMix data combination. The authors train 7B and 34B-context LMs (Llemma family) that achieve competitive TutorEval performance and strong math-problem solving, surpassing several baselines on math tasks while maintaining tutoring quality. By releasing the data, evaluations, and models, the paper provides a practical foundation for developing real-world educational AI that can process lengthy scientific texts and engage in meaningful didactic dialogue.

Abstract

NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations.

Language Models as Science Tutors

TL;DR

Abstract

Paper Structure (60 sections, 6 figures, 15 tables)

This paper contains 60 sections, 6 figures, 15 tables.

Introduction
Related Work
LMs for science
Fine-tuning on model-generated dialogue
LM as an evaluator
TutorEval
Dataset construction
Question collection
Dataset composition
Data categories
Data validation
LM as an Evaluator
Key points as ground-truths
LM-powered evaluation
Human-GPT-4 agreement
...and 45 more sections

Figures (6)

Figure 1: Example from TutorEval. Given the chapter, the student asks a question to the LM Tutor. Both the chapter and the question are fed to the LM Tutor to generate the answer. GPT-4 assesses the generation by referencing the human annotated key points (blue: the tutoring task; yellow: evaluation). See detailed examples in §\ref{['appdenix:eval_examples']}.
Figure 2: Left: performance breakdown on TutorEval by domains. Right: leaderboard of popular models on TutorEval. Our models, marked in bold, achieve competitive TutorEval performance.
Figure 3: We show the correlation between the scores from 17 annotators and the GPT-4 scores for four models: Vicuna-13B-16K, Llemma-7B-32K-Ultrachat, Llemma-7B-32K-MathMix, and GPT-4. Each annotator evaluates these models on their own set of 50 questions.
Figure 4: TutorEval results for fine-tuning Llemma-7B-32K with various subsets of TutorChat-STEM. Each subset contains 10K samples. See Table \ref{['tab:data_ablations']} for more results.
Figure 5: Combined performance on TutorEval and math oriented datasets (average of GSM8K & MATH). In red are our models trained with MathMix, with 7B and 34B parameters. In purple are 7B-parameter baselines trained from Llemma-7B-32K. We also include the pre-trained Mistral-7B-V2 in green.
...and 1 more figures

Language Models as Science Tutors

TL;DR

Abstract

Language Models as Science Tutors

Authors

TL;DR

Abstract

Table of Contents

Figures (6)