Table of Contents
Fetching ...

Domain-Grounded Evaluation of LLMs in International Student Knowledge

Claudinei Daitx, Haitham Amar

TL;DR

This work addresses the risk and reliability of large language models in study-abroad advising, where answers must integrate cross-domain policies (visas, admissions, scholarships) and stay grounded in up-to-date references. It introduces a domain-grounded evaluation protocol that jointly measures accuracy and hallucination, using an ApplyBoard-derived dataset, a partial-credit rubric, and metrics for domain routing, evidence alignment, and claim-level support. The study jointly compares multiple LLM families under shared prompts, reporting factual correctness, answer accuracy, faithfulness, answer relevancy, and ANAH v2 scores to characterize model behavior. The findings highlight distinct model profiles and actionable deployment guidance, such as pairing conservative, reference-bound configurations for policy-sensitive tasks with higher-relevance, guarded approaches for exploratory queries, thereby enabling safer, more reliable educational advising at scale.

Abstract

Large language models (LLMs) are increasingly used to answer high-stakes study-abroad questions about admissions, visas, scholarships, and eligibility. Yet it remains unclear how reliably they advise students, and how often otherwise helpful answers drift into unsupported claims (``hallucinations''). This work provides a clear, domain-grounded overview of how current LLMs behave in this setting. Using realistic questions set drawn from ApplyBoard's advising workflows -- an EdTech platform that supports students from discovery to enrolment -- we evaluate two essentials side by side: accuracy (is the information correct and complete?) and hallucination (does the model add content not supported by the question or domain evidence). These questions are categorized by domain scope which can be a single-domain or multi-domain -- when it must integrate evidence across areas such as admissions, visas, and scholarships. To reflect real advising quality, we grade answers with a simple rubric which is correct, partial, or wrong. The rubric is domain-coverage-aware: an answer can be partial if it addresses only a subset of the required domains, and it can be over-scoped if it introduces extra, unnecessary domains; both patterns are captured in our scoring as under-coverage or reduced relevance/hallucination. We also report measures of faithfulness and answer relevance, alongside an aggregate hallucination score, to capture relevance and usefulness. All models are tested with the same questions for a fair, head-to-head comparison. Our goals are to: (1) give a clear picture of which models are most dependable for study-abroad advising, (2) surface common failure modes -- where answers are incomplete, off-topic, or unsupported, and (3) offer a practical, reusable protocol for auditing LLMs before deployment in education and advising contexts.

Domain-Grounded Evaluation of LLMs in International Student Knowledge

TL;DR

This work addresses the risk and reliability of large language models in study-abroad advising, where answers must integrate cross-domain policies (visas, admissions, scholarships) and stay grounded in up-to-date references. It introduces a domain-grounded evaluation protocol that jointly measures accuracy and hallucination, using an ApplyBoard-derived dataset, a partial-credit rubric, and metrics for domain routing, evidence alignment, and claim-level support. The study jointly compares multiple LLM families under shared prompts, reporting factual correctness, answer accuracy, faithfulness, answer relevancy, and ANAH v2 scores to characterize model behavior. The findings highlight distinct model profiles and actionable deployment guidance, such as pairing conservative, reference-bound configurations for policy-sensitive tasks with higher-relevance, guarded approaches for exploratory queries, thereby enabling safer, more reliable educational advising at scale.

Abstract

Large language models (LLMs) are increasingly used to answer high-stakes study-abroad questions about admissions, visas, scholarships, and eligibility. Yet it remains unclear how reliably they advise students, and how often otherwise helpful answers drift into unsupported claims (``hallucinations''). This work provides a clear, domain-grounded overview of how current LLMs behave in this setting. Using realistic questions set drawn from ApplyBoard's advising workflows -- an EdTech platform that supports students from discovery to enrolment -- we evaluate two essentials side by side: accuracy (is the information correct and complete?) and hallucination (does the model add content not supported by the question or domain evidence). These questions are categorized by domain scope which can be a single-domain or multi-domain -- when it must integrate evidence across areas such as admissions, visas, and scholarships. To reflect real advising quality, we grade answers with a simple rubric which is correct, partial, or wrong. The rubric is domain-coverage-aware: an answer can be partial if it addresses only a subset of the required domains, and it can be over-scoped if it introduces extra, unnecessary domains; both patterns are captured in our scoring as under-coverage or reduced relevance/hallucination. We also report measures of faithfulness and answer relevance, alongside an aggregate hallucination score, to capture relevance and usefulness. All models are tested with the same questions for a fair, head-to-head comparison. Our goals are to: (1) give a clear picture of which models are most dependable for study-abroad advising, (2) surface common failure modes -- where answers are incomplete, off-topic, or unsupported, and (3) offer a practical, reusable protocol for auditing LLMs before deployment in education and advising contexts.

Paper Structure

This paper contains 13 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Accuracy comparison across models. Higher values indicate more correct answers.
  • Figure 2: Faithfulness comparison across models. Higher values indicate greater adherence to references.
  • Figure 3: Answer relevancy comparison across models. Higher values indicate more on-topic responses.
  • Figure 4: ANAH v2 comparison across models. Lower values indicate fewer hallucinated segments.
  • Figure 5: Hallucination comparison across models. Lower values indicate fewer hallucinations.