A Retrieval-Augmented Generation Framework for Academic Literature Navigation in Data Science
Ahmet Yasin Aytar, Kemal Kilic, Kamer Kaya
TL;DR
This work tackles the problem of efficiently navigating expansive data science literature amid LLm hallucinations by deploying an enhanced Retrieval-Augmented Generation (RAG) framework. It introduces a five-stage pipeline—GROBID-based data cleaning, domain-specific embedding fine-tuning, semantic chunking, abstract-first retrieval, and advanced prompting—and evaluates it with the RAGAS framework on a domain-spanning question set aligned with CRISP-DM. The study demonstrates that larger, domain-specific fine-tuning markedly improves Context Relevance, while semantic chunking and abstract-first retrieval further boost retrieval precision, with prompting enhancing answer quality. Collectively, the approach reduces information overload and supports more decisive, literature-grounded data science workflows; the work also outlines avenues for future benchmarking and knowledge-graph integration to further enhance accuracy and reliability.
Abstract
In the rapidly evolving field of data science, efficiently navigating the expansive body of academic literature is crucial for informed decision-making and innovation. This paper presents an enhanced Retrieval-Augmented Generation (RAG) application, an artificial intelligence (AI)-based system designed to assist data scientists in accessing precise and contextually relevant academic resources. The AI-powered application integrates advanced techniques, including the GeneRation Of BIbliographic Data (GROBID) technique for extracting bibliographic information, fine-tuned embedding models, semantic chunking, and an abstract-first retrieval method, to significantly improve the relevance and accuracy of the retrieved information. This implementation of AI specifically addresses the challenge of academic literature navigation. A comprehensive evaluation using the Retrieval-Augmented Generation Assessment System (RAGAS) framework demonstrates substantial improvements in key metrics, particularly Context Relevance, underscoring the system's effectiveness in reducing information overload and enhancing decision-making processes. Our findings highlight the potential of this enhanced Retrieval-Augmented Generation system to transform academic exploration within data science, ultimately advancing the workflow of research and innovation in the field.
