ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature

Aarush Sinha; Viraj Virk; Dipshikha Chakraborty; P. S. Sreeja

ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature

Aarush Sinha, Viraj Virk, Dipshikha Chakraborty, P. S. Sreeja

TL;DR

ArxEval introduces a structured evaluation pipeline to quantify hallucination in LLMs when processing scientific literature, using ArXiv as the knowledge base. It defines two targeted tasks, Jumbled Titles and Mixed Titles, to probe retrieval fidelity and factual consistency across fifteen open-source models with both semantic and bibliographic validations. The study finds that even competitive, smaller models can outperform larger ones and that DOIs generated for mixed titles are frequently invalid or mismatched, highlighting significant challenges in domain-specific faithfulness. The work provides actionable insights for designing reliable retrieval-generation systems in scientific contexts and underscores the need for robust validation against scholarly metadata and sources.

Abstract

Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in handling scientific literature.

ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature

TL;DR

Abstract

Paper Structure (20 sections, 5 figures, 5 tables, 2 algorithms)

This paper contains 20 sections, 5 figures, 5 tables, 2 algorithms.

Introduction
Related Work
Hallucinations in Large Language Models (LLMs)
Hallucinations in Domain-Specific Settings
Definition and Challenges
Causes and Perspectives
Detection and Evaluation Frameworks
Hallucinations in Multimodal Settings
Definition and Challenges
Evaluation Frameworks
Hallucinations in Natural Language Generation
Hallucinations in Academic Reference Generation
Dataset
Methodology
Jumbled Titles
...and 5 more sections

Figures (5)

Figure 1: Number of titles from each subject.
Figure 2: Our proposed pipeline for evaluating the LMs on use of the ArXiv as source for papers.
Figure 3: CosSim, BERTScore and STS Scores for all models.
Figure 4: DOIs generated by each model when prompted during the Mixed Title task.
Figure 5: DOIs Found for each model vs DOIs not for each model

ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature

TL;DR

Abstract

ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature

Authors

TL;DR

Abstract

Table of Contents

Figures (5)