Large Language Model Augmented Exercise Retrieval for Personalized Language Learning
Austin Xu, Will Monroe, Klinton Bicknell
TL;DR
This work tackles zero-shot exercise retrieval for learner-directed language learning, identifying a fundamental referential similarity gap between how learners describe learning objectives and the actual exercise content. It introduces mHyER, which combines multilingual contrastive pretraining with LLM-generated hypothetical retrieval candidates to bridge this gap and perform near-neighbor search on a fixed exercise catalog. The authors create two novel benchmarks, DuoRD and Tatoeba Tags, and demonstrate that mHyER substantially outperforms strong baselines across both datasets and settings, with ablations confirming the complementary benefits of contrastive training and candidate synthesis. The approach enables explicit learner control over content, offering a practical path to more self-directed, personalized language learning at scale.
Abstract
We study the problem of zero-shot exercise retrieval in the context of online language learning, to give learners the ability to explicitly request personalized exercises via natural language. Using real-world data collected from language learners, we observe that vector similarity approaches poorly capture the relationship between exercise content and the language that learners use to express what they want to learn. This semantic gap between queries and content dramatically reduces the effectiveness of general-purpose retrieval models pretrained on large scale information retrieval datasets like MS MARCO. We leverage the generative capabilities of large language models to bridge the gap by synthesizing hypothetical exercises based on the learner's input, which are then used to search for relevant exercises. Our approach, which we call mHyER, overcomes three challenges: (1) lack of relevance labels for training, (2) unrestricted learner input content, and (3) low semantic similarity between input and retrieval candidates. mHyER outperforms several strong baselines on two novel benchmarks created from crowdsourced data and publicly available data.
