Table of Contents
Fetching ...

Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

Francesco Granata, Francesco Poggi, Misael Mongiovì

TL;DR

This work tackles factual inaccuracies in large language model–driven educational QA by integrating a Wikidata-based Entity Linking module into a Retrieval-Augmented Generation pipeline. The proposed ELERAG architecture uses Reciprocal Rank Fusion to combine semantic dense retrieval with entity-grounded signals, and is evaluated on Italian educational data and the SQuAD-it benchmark. Results show a domain-dependent pattern: ELERAG excels on domain-specific educational content, while a Cross-Encoder re-ranker dominates general-domain retrieval; this demonstrates a Domain Mismatch and the value of domain-adapted hybrid approaches. The study contributes a practical, efficient framework for improving factual precision and reliability in AI-based tutoring tools, with implications for multilingual and educational AI deployments.

Abstract

In the era of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) architectures are gaining significant attention for their ability to ground language generation in reliable knowledge sources. Despite their impressive effectiveness in many areas, RAG systems based solely on semantic similarity often fail to ensure factual accuracy in specialized domains, where terminological ambiguity can affect retrieval relevance. This study proposes an enhanced RAG architecture that integrates a factual signal derived from Entity Linking to improve the accuracy of educational question-answering systems in Italian. The system includes a Wikidata-based Entity Linking module and implements three re-ranking strategies to combine semantic and entity-based information: a hybrid score weighting model, reciprocal rank fusion, and a cross-encoder re-ranker. Experiments were conducted on two benchmarks: a custom academic dataset and the standard SQuAD-it dataset. Results show that, in domain-specific contexts, the hybrid schema based on reciprocal rank fusion significantly outperforms both the baseline and the cross-encoder approach, while the cross-encoder achieves the best results on the general-domain dataset. These findings confirm the presence of an effect of domain mismatch and highlight the importance of domain adaptation and hybrid ranking strategies to enhance factual precision and reliability in retrieval-augmented generation. They also demonstrate the potential of entity-aware RAG systems in educational environments, fostering adaptive and reliable AI-based tutoring tools.

Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

TL;DR

This work tackles factual inaccuracies in large language model–driven educational QA by integrating a Wikidata-based Entity Linking module into a Retrieval-Augmented Generation pipeline. The proposed ELERAG architecture uses Reciprocal Rank Fusion to combine semantic dense retrieval with entity-grounded signals, and is evaluated on Italian educational data and the SQuAD-it benchmark. Results show a domain-dependent pattern: ELERAG excels on domain-specific educational content, while a Cross-Encoder re-ranker dominates general-domain retrieval; this demonstrates a Domain Mismatch and the value of domain-adapted hybrid approaches. The study contributes a practical, efficient framework for improving factual precision and reliability in AI-based tutoring tools, with implications for multilingual and educational AI deployments.

Abstract

In the era of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) architectures are gaining significant attention for their ability to ground language generation in reliable knowledge sources. Despite their impressive effectiveness in many areas, RAG systems based solely on semantic similarity often fail to ensure factual accuracy in specialized domains, where terminological ambiguity can affect retrieval relevance. This study proposes an enhanced RAG architecture that integrates a factual signal derived from Entity Linking to improve the accuracy of educational question-answering systems in Italian. The system includes a Wikidata-based Entity Linking module and implements three re-ranking strategies to combine semantic and entity-based information: a hybrid score weighting model, reciprocal rank fusion, and a cross-encoder re-ranker. Experiments were conducted on two benchmarks: a custom academic dataset and the standard SQuAD-it dataset. Results show that, in domain-specific contexts, the hybrid schema based on reciprocal rank fusion significantly outperforms both the baseline and the cross-encoder approach, while the cross-encoder achieves the best results on the general-domain dataset. These findings confirm the presence of an effect of domain mismatch and highlight the importance of domain adaptation and hybrid ranking strategies to enhance factual precision and reliability in retrieval-augmented generation. They also demonstrate the potential of entity-aware RAG systems in educational environments, fostering adaptive and reliable AI-based tutoring tools.

Paper Structure

This paper contains 27 sections, 2 equations, 1 figure, 4 tables.

Figures (1)

  • Figure S1: Architectural schema of the proposed ELERAG method. The system integrates parallel retrieval paths—semantic dense retrieval and entity linking—fusing them via an RRF-based re-ranking module to ground the LLM generation.