Table of Contents
Fetching ...

SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents

Jaehoon Lee, Sohyun Kim, Wanggeun Park, Geon Lee, Seungkyung Kim, Minyoung Lee

TL;DR

SDS KoPub VDR addresses the lack of multilingual and structurally diverse benchmarks for visual document retrieval by introducing a large-scale Korean public-document dataset. It comprises 361 real-world documents (40,781 pages) with 600 query–page–answer triples, supporting text-only and multimodal retrieval tasks, and uses a rigorous three-stage QA generation and validation pipeline. The work provides reproducible evaluation protocols, baseline models (text- and multimodal), and domain- and query-type analyses that reveal substantial gains from visual information and current challenges in cross-modal reasoning. By releasing an open benchmark with governance and clear licensing, it enables rigorous assessment of multimodal RAG systems and guides future research toward robust Korean document intelligence in real-world settings.

Abstract

Existing benchmarks for visual document retrieval (VDR) largely overlook non-English languages and the structural complexity of official publications. To address this gap, we introduce SDS KoPub VDR, the first large-scale, public benchmark for retrieving and understanding Korean public documents. The benchmark is built upon 361 real-world documents, including 256 files under the KOGL Type 1 license and 105 from official legal portals, capturing complex visual elements like tables, charts, and multi-column layouts. To establish a reliable evaluation set, we constructed 600 query-page-answer triples. These were initially generated using multimodal models (e.g., GPT-4o) and subsequently underwent human verification to ensure factual accuracy and contextual relevance. The queries span six major public domains and are categorized by the reasoning modality required: text-based, visual-based, and cross-modal. We evaluate SDS KoPub VDR on two complementary tasks: (1) text-only retrieval and (2) multimodal retrieval, which leverages visual features alongside text. This dual-task evaluation reveals substantial performance gaps, particularly in multimodal scenarios requiring cross-modal reasoning, even for state-of-the-art models. As a foundational resource, SDS KoPub VDR enables rigorous and fine-grained evaluation and provides a roadmap for advancing multimodal AI in real-world document intelligence. The dataset is available at https://huggingface.co/datasets/SamsungSDS-Research/SDS-KoPub-VDR-Benchmark.

SDS KoPub VDR: A Benchmark Dataset for Visual Document Retrieval in Korean Public Documents

TL;DR

SDS KoPub VDR addresses the lack of multilingual and structurally diverse benchmarks for visual document retrieval by introducing a large-scale Korean public-document dataset. It comprises 361 real-world documents (40,781 pages) with 600 query–page–answer triples, supporting text-only and multimodal retrieval tasks, and uses a rigorous three-stage QA generation and validation pipeline. The work provides reproducible evaluation protocols, baseline models (text- and multimodal), and domain- and query-type analyses that reveal substantial gains from visual information and current challenges in cross-modal reasoning. By releasing an open benchmark with governance and clear licensing, it enables rigorous assessment of multimodal RAG systems and guides future research toward robust Korean document intelligence in real-world settings.

Abstract

Existing benchmarks for visual document retrieval (VDR) largely overlook non-English languages and the structural complexity of official publications. To address this gap, we introduce SDS KoPub VDR, the first large-scale, public benchmark for retrieving and understanding Korean public documents. The benchmark is built upon 361 real-world documents, including 256 files under the KOGL Type 1 license and 105 from official legal portals, capturing complex visual elements like tables, charts, and multi-column layouts. To establish a reliable evaluation set, we constructed 600 query-page-answer triples. These were initially generated using multimodal models (e.g., GPT-4o) and subsequently underwent human verification to ensure factual accuracy and contextual relevance. The queries span six major public domains and are categorized by the reasoning modality required: text-based, visual-based, and cross-modal. We evaluate SDS KoPub VDR on two complementary tasks: (1) text-only retrieval and (2) multimodal retrieval, which leverages visual features alongside text. This dual-task evaluation reveals substantial performance gaps, particularly in multimodal scenarios requiring cross-modal reasoning, even for state-of-the-art models. As a foundational resource, SDS KoPub VDR enables rigorous and fine-grained evaluation and provides a roadmap for advancing multimodal AI in real-world document intelligence. The dataset is available at https://huggingface.co/datasets/SamsungSDS-Research/SDS-KoPub-VDR-Benchmark.

Paper Structure

This paper contains 51 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: SDS KoPub VDR Benchmark Dataset Construction Process
  • Figure 2: Query type example
  • Figure 3: Distribution of page compositions and visual elements
  • Figure 4: Summary of text-only retrieval performances
  • Figure 5: Summary of multimodal retrieval performances
  • ...and 10 more figures