Table of Contents
Fetching ...

KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents

Hsin-Ling Hsu, Ping-Sheng Lin, Jing-Di Lin, Jengnan Tzeng

TL;DR

The paper addresses the challenge of enabling effective retrieval from OCR outputs for Traditional Chinese non-narrative documents, where OCR noise, complex formatting, and limited synonym understanding hinder sparse and dense retrieval. It introduces Knowledge-Aware Preprocessing (KAP), a two-stage framework that first performs OCR extraction and then uses Multimodal Large Language Models to refine the text by incorporating visual cues from the original documents. KAP aims to reduce OCR noise, reconstruct document structure, and format text to meet the distinct requirements of hybrid retrieval systems. Experimental results show that KAP consistently and significantly outperforms conventional preprocessing approaches, with code available at the provided GitHub link.

Abstract

Hybrid Retrieval systems, combining Sparse and Dense Retrieval methods, struggle with Traditional Chinese non-narrative documents due to their complex formatting, rich vocabulary, and the insufficient understanding of Chinese synonyms by common embedding models. Previous approaches inadequately address the dual needs of these systems, focusing mainly on general text quality improvement rather than optimizing for retrieval. We propose Knowledge-Aware Preprocessing (KAP), a novel framework that transforms noisy OCR outputs into retrieval-optimized text. KAP adopts a two-stage approach: it first extracts text using OCR, then employs Multimodal Large Language Models to refine the output by integrating visual information from the original documents. This design reduces OCR noise, reconstructs structural elements, and formats the text to satisfy the distinct requirements of sparse and dense retrieval. Empirical results demonstrate that KAP consistently and significantly outperforms conventional preprocessing approaches. Our code is available at https://github.com/JustinHsu1019/KAP.

KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents

TL;DR

The paper addresses the challenge of enabling effective retrieval from OCR outputs for Traditional Chinese non-narrative documents, where OCR noise, complex formatting, and limited synonym understanding hinder sparse and dense retrieval. It introduces Knowledge-Aware Preprocessing (KAP), a two-stage framework that first performs OCR extraction and then uses Multimodal Large Language Models to refine the text by incorporating visual cues from the original documents. KAP aims to reduce OCR noise, reconstruct document structure, and format text to meet the distinct requirements of hybrid retrieval systems. Experimental results show that KAP consistently and significantly outperforms conventional preprocessing approaches, with code available at the provided GitHub link.

Abstract

Hybrid Retrieval systems, combining Sparse and Dense Retrieval methods, struggle with Traditional Chinese non-narrative documents due to their complex formatting, rich vocabulary, and the insufficient understanding of Chinese synonyms by common embedding models. Previous approaches inadequately address the dual needs of these systems, focusing mainly on general text quality improvement rather than optimizing for retrieval. We propose Knowledge-Aware Preprocessing (KAP), a novel framework that transforms noisy OCR outputs into retrieval-optimized text. KAP adopts a two-stage approach: it first extracts text using OCR, then employs Multimodal Large Language Models to refine the output by integrating visual information from the original documents. This design reduces OCR noise, reconstructs structural elements, and formats the text to satisfy the distinct requirements of sparse and dense retrieval. Empirical results demonstrate that KAP consistently and significantly outperforms conventional preprocessing approaches. Our code is available at https://github.com/JustinHsu1019/KAP.

Paper Structure

This paper contains 4 sections.