Table of Contents
Fetching ...

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

Junyuan Zhang, Qintong Zhang, Bin Wang, Linke Ouyang, Zichen Wen, Ying Li, Ka-Ho Chow, Conghui He, Wentao Zhang

TL;DR

This work introduces OHRBench, a large, multimodal benchmark that quantifies how OCR-derived noise cascades through Retrieval-Augmented Generation systems. By collecting 1,261 PDFs and 8,561 page images across seven domains, constructing ground-truth structured data and 8,498 Q&A pairs, and applying semantic and formatting perturbations, the study reveals that current OCR solutions are insufficient for high-quality RAG knowledge bases. It further demonstrates that semantic noise consistently degrades both retrieval and generation, while formatting noise has variable effects depending on retrievers and generators, with larger LLMs showing greater robustness. The findings emphasize the need for OCR-tailored robustness and better integration between OCR outputs and RAG pipelines, and the dataset is released to drive advances in this area.

Abstract

Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 8,561 carefully selected unstructured document images from seven real-world RAG application domains, along with 8,498 Q&A pairs derived from multimodal elements in documents, challenging existing OCR solutions used for RAG. To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the trend relationship between the degree of OCR noise and RAG performance. Our OHRBench, including PDF documents, Q&As, and the ground truth structured data are released at: https://github.com/opendatalab/OHR-Bench

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

TL;DR

This work introduces OHRBench, a large, multimodal benchmark that quantifies how OCR-derived noise cascades through Retrieval-Augmented Generation systems. By collecting 1,261 PDFs and 8,561 page images across seven domains, constructing ground-truth structured data and 8,498 Q&A pairs, and applying semantic and formatting perturbations, the study reveals that current OCR solutions are insufficient for high-quality RAG knowledge bases. It further demonstrates that semantic noise consistently degrades both retrieval and generation, while formatting noise has variable effects depending on retrievers and generators, with larger LLMs showing greater robustness. The findings emphasize the need for OCR-tailored robustness and better integration between OCR outputs and RAG pipelines, and the dataset is released to drive advances in this area.

Abstract

Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 8,561 carefully selected unstructured document images from seven real-world RAG application domains, along with 8,498 Q&A pairs derived from multimodal elements in documents, challenging existing OCR solutions used for RAG. To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the trend relationship between the degree of OCR noise and RAG performance. Our OHRBench, including PDF documents, Q&As, and the ground truth structured data are released at: https://github.com/opendatalab/OHR-Bench

Paper Structure

This paper contains 40 sections, 18 figures, 17 tables.

Figures (18)

  • Figure 1: Our OHRBench comprises documents from 7 domains, 9 challenging attributes for OCR, 4 types of Q&A tasks, and 5 Q&A evidence sources. Each number indicates the count of PDF pages with that attribute. Criteria for these attributes can be found in Appendix \ref{['appendix:complex_layout']}
  • Figure 1: Dataset Statistics
  • Figure 2: Construction of OHRBench and evaluation protocol. (1) Benchmark Dataset: documents from seven domains, human-verified ground truth structured data, and Q&As from multimodal document elements. (2) RAG Knowledge Base: Current OCR results for benchmarking and perturbed data for assessment. (3) Evaluation of OCR impact on each component and the overall RAG system.
  • Figure 3: Impact of Semantic Noise ([S] dashed lines) and Formatting Noise ([F] solid lines) on RAG components. The horizontal axis denotes the ratio $r_{\text{noise}}$, where higher values indicate greater OCR-induced noise. We report LCS and F1-score for each evidence source: text (first column), the average score for multimodal elements (tables, formulas, and charts, second column), reading order (third column), and all sources combined (last column).
  • Figure 4: Performance of retrieval, generation and end-to-end with different table format. We only report the results of table-related questions.
  • ...and 13 more figures