Table of Contents
Fetching ...

SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning

Sebastian Haan

TL;DR

SemanticCite introduces a full-text, AI-powered citation verification framework that moves beyond abstract-level checks by combining a hybrid retrieval pipeline with a four-class taxonomy (SUPPORTED, PARTIALLY SUPPORTED, UNSUPPORTED, UNCERTAIN) and evidence-based reasoning. It demonstrates that fine-tuned lightweight models (Qwen3 variants) can achieve competitive performance with significantly reduced computation, while providing transparent explanations and ranked textual evidence. The work provides a 1,111-citation dataset across eight disciplines, open-source software, and an end-to-end pipeline including a Streamlit interface for practical deployment. The approach promises scalable, interpretable improvements in research integrity, peer review efficiency, and AI-generated content quality control, with clear paths for multilingual, multimodal, and multi-reference extensions.

Abstract

Effective scientific communication depends on accurate citations that validate sources and guide readers to supporting evidence. Yet academic literature faces mounting challenges: semantic citation errors that misrepresent sources, AI-generated hallucinated references, and traditional citation formats that point to entire papers without indicating which sections substantiate specific claims. We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis while providing rich contextual information via detailed reasoning and relevant text snippets. Our approach combines multiple retrieval methods with a four-class classification system (Supported, Partially Supported, Unsupported, Uncertain) that captures nuanced claim-source relationships and enables appropriate remedial actions for different error types. Our experiments show that fine-tuned lightweight language models achieve performance comparable to large commercial systems with significantly lower computational requirements, making large-scale citation verification practically feasible. The system provides transparent, evidence-based explanations that support user understanding and trust. We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata across eight disciplines, alongside fine-tuned models and the complete verification framework as open-source software. SemanticCite addresses critical challenges in research integrity through scalable citation verification, streamlined peer review, and quality control for AI-generated content, providing an open-source foundation for maintaining citation accuracy at scale.

SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning

TL;DR

SemanticCite introduces a full-text, AI-powered citation verification framework that moves beyond abstract-level checks by combining a hybrid retrieval pipeline with a four-class taxonomy (SUPPORTED, PARTIALLY SUPPORTED, UNSUPPORTED, UNCERTAIN) and evidence-based reasoning. It demonstrates that fine-tuned lightweight models (Qwen3 variants) can achieve competitive performance with significantly reduced computation, while providing transparent explanations and ranked textual evidence. The work provides a 1,111-citation dataset across eight disciplines, open-source software, and an end-to-end pipeline including a Streamlit interface for practical deployment. The approach promises scalable, interpretable improvements in research integrity, peer review efficiency, and AI-generated content quality control, with clear paths for multilingual, multimodal, and multi-reference extensions.

Abstract

Effective scientific communication depends on accurate citations that validate sources and guide readers to supporting evidence. Yet academic literature faces mounting challenges: semantic citation errors that misrepresent sources, AI-generated hallucinated references, and traditional citation formats that point to entire papers without indicating which sections substantiate specific claims. We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis while providing rich contextual information via detailed reasoning and relevant text snippets. Our approach combines multiple retrieval methods with a four-class classification system (Supported, Partially Supported, Unsupported, Uncertain) that captures nuanced claim-source relationships and enables appropriate remedial actions for different error types. Our experiments show that fine-tuned lightweight language models achieve performance comparable to large commercial systems with significantly lower computational requirements, making large-scale citation verification practically feasible. The system provides transparent, evidence-based explanations that support user understanding and trust. We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata across eight disciplines, alongside fine-tuned models and the complete verification framework as open-source software. SemanticCite addresses critical challenges in research integrity through scalable citation verification, streamlined peer review, and quality control for AI-generated content, providing an open-source foundation for maintaining citation accuracy at scale.

Paper Structure

This paper contains 38 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Semantic Citation Verification Pipeline: A multi-stage automated system for citation verification combining document processing (PDF extraction and text chunking), vector embedding storage, hybrid retrieval using both dense semantic similarity and sparse BM25 keyword matching, neural reranking with FlashRank, and LLM-based analysis. The pipeline outputs a classification result, supporting evidence, detailed reasoning, and confidence score for each citation verification task.
  • Figure 2: Four-Category Classification Scheme for Source-Claim Alignment Assessment
  • Figure 3: SemanticCite web interface showing citation input, reference document upload options (file upload or URL download), optional metadata entry, and configurable model parameters. The interface supports multiple LLM providers and embedding models, enabling flexible deployment across different resource constraints and institutional requirements.
  • Figure 4: Overview of the data selection and processing pipeline for semantic citation verification