Table of Contents
Fetching ...

A SMART Mnemonic Sounds like "Glue Tonic": Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick

Nishant Balepur, Matthew Shu, Alexander Hoyle, Alison Robey, Shi Feng, Seraphina Goldfarb-Tarrant, Jordan Boyd-Graber

TL;DR

SMART, a mnemonic generator trained on feedback from real students learning new terms, is built and assessed as matching GPT-4 at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.

Abstract

Keyword mnemonics are memorable explanations that link new terms to simpler keywords. Prior work generates mnemonics for students, but they do not train models using mnemonics students prefer and aid learning. We build SMART, a mnemonic generator trained on feedback from real students learning new terms. To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics. We then use LLM alignment to enhance SMART: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor. We gather 2684 preferences from 45 students across two types: expressed (inferred from ratings) and observed (inferred from student learning), yielding three key findings. First, expressed and observed preferences disagree; what students think is helpful does not always capture what is truly helpful. Second, Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal. SMART is tuned via Direct Preference Optimization on this signal, which resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains. Third, mnemonic experts assess SMART as matching GPT-4 at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.

A SMART Mnemonic Sounds like "Glue Tonic": Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick

TL;DR

SMART, a mnemonic generator trained on feedback from real students learning new terms, is built and assessed as matching GPT-4 at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.

Abstract

Keyword mnemonics are memorable explanations that link new terms to simpler keywords. Prior work generates mnemonics for students, but they do not train models using mnemonics students prefer and aid learning. We build SMART, a mnemonic generator trained on feedback from real students learning new terms. To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics. We then use LLM alignment to enhance SMART: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor. We gather 2684 preferences from 45 students across two types: expressed (inferred from ratings) and observed (inferred from student learning), yielding three key findings. First, expressed and observed preferences disagree; what students think is helpful does not always capture what is truly helpful. Second, Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal. SMART is tuned via Direct Preference Optimization on this signal, which resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains. Third, mnemonic experts assess SMART as matching GPT-4 at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.
Paper Structure (46 sections, 8 equations, 10 figures, 14 tables)

This paper contains 46 sections, 8 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Smart overview. We fine-tune LLaMA-2 70B for the initial Smart model (\ref{['section:initial_model']}). We then collect three preference types: pairwise, rating, and learning (\ref{['section:user_study']}). Finally, a Bayesian model synthesizes mnemonic effectiveness from all three preferences (\ref{['subsection:bayesian_rlhf']}) and we use this signal to align Smart via Direct Preference Optimization (\ref{['subsection:DPO']}).
  • Figure 2: Screenshot from our web-based flashcard app after a user is presented a GRE vocabulary flashcard.
  • Figure 3: Screenshot of UI to collect Likert ratings.
  • Figure 4: Screenshot of UI for pairwise comparisons.
  • Figure 5: Correlation between user mnemonic ratings and turns needed for the same user to recall the term when studying with said mnemonic (jittered). Users cannot predict which mnemonics will best help them learn.
  • ...and 5 more figures