Validating LLM-Generated Relevance Labels for Educational Resource Search
Ratan J. Sebastian, Anett Hoppe
TL;DR
The paper addresses the challenge of domain-specific relevance judgments for educational resources by evaluating how LLMs perform when prompted with education-aligned criteria. It compares three prompt frameworks (baseline, literature-derived, and participant-derived) across two input representations, using ground-truth data from a teacher-focused user study and metrics like Cohen's $\kappa$ and Rank-Biased Overlap ($\text{RBO}$). The results show that domain-specific prompts substantially improve alignment with human judgments, with the participant-derived framework achieving the highest agreement ($\kappa$ up to $0.639$) and two-input schemes (HEAD vs SKIM) affecting performance. System-level evaluation reveals LLMs can reliably identify hard queries (RBO $0.713$-$0.886$) but discriminate between similar retrieval systems more modestly ($\text{RBO}$ $0.52$-$0.56$), and performance differences are observed between proprietary and open-source models. Overall, the findings support using domain-informed prompting to scale educational-resource evaluation, while noting context- and model-dependent limitations and the potential need for open benchmarks.
Abstract
Manual relevance judgements in Information Retrieval are costly and require expertise, driving interest in using Large Language Models (LLMs) for automatic assessment. While LLMs have shown promise in general web search scenarios, their effectiveness for evaluating domain-specific search results, such as educational resources, remains unexplored. To investigate different ways of including domain-specific criteria in LLM prompts for relevance judgement, we collected and released a dataset of 401 human relevance judgements from a user study involving teaching professionals performing search tasks related to lesson planning. We compared three approaches to structuring these prompts: a simple two-aspect evaluation baseline from prior work on using LLMs as relevance judges, a comprehensive 12-dimensional rubric derived from educational literature, and criteria directly informed by the study participants. Using domain-specific frameworks, LLMs achieved strong agreement with human judgements (Cohen's $κ$ up to 0.650), significantly outperforming the baseline approach. The participant-derived framework proved particularly robust, with GPT-3.5 achieving $κ$ scores of 0.639 and 0.613 for 10-dimension and 5-dimension versions respectively. System-level evaluation showed that LLM judgements reliably identified top-performing retrieval approaches (RBO scores 0.71-0.76) while maintaining reasonable discrimination between systems (RBO 0.52-0.56). These findings suggest that LLMs can effectively evaluate educational resources when prompted with domain-specific criteria, though performance varies with framework complexity and input structure.
