Table of Contents
Fetching ...

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Junjie Wang, Yuxiang Zhang, Minghao Liu, Yin Zhang, Yatai Ji, Weihao Xuan, Nie Lin, Kang Zhu, Zhiqiang Lin, Yiming Ren, Chunyang Jiang, Yiyao Yu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Qunshu Lin, Yujiu Yang, Ge Zhang, Ruibin Yuan, Bei Chen, Wenhu Chen

TL;DR

This work identifies persistent perceptual and reasoning errors in large multimodal models and introduces PIN, a knowledge-intensive data format that pairs semantically rich Markdown with holistic document images to preserve fine-grained textual structures while capturing global layout. It operationalizes PIN into two large datasets, PIN-200M and PIN-14M, spanning diverse English and Chinese sources and enhanced by quality signals for targeted data filtering. The authors detail a unified data-processing pipeline, multiple subset demonstrations (including PIN-Arxiv, PIN-PMC, DocLayNet, and web/text sources), and a suite of training strategies that leverage both image-text pairs and interleaved documents, along with potential new tasks like image-from-Markdown generation. Together, PIN provides a scalable, versatile foundation for pre-training knowledge-intensive multimodal models and advancing document understanding in scientific and web domains.

Abstract

Recent advancements in large multimodal models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. To address these issues, we introduce PIN (Paired and INterleaved multimodal documents), a novel data format designed to foster a deeper integration of visual and textual knowledge. The PIN format uniquely combines semantically rich Markdown files, which preserve fine-grained textual structures, with holistic overall images that capture the complete document layout. Following this format, we construct and release two large-scale, open-source datasets: PIN-200M (~200 million documents) and PIN-14M (~14 million), compiled from diverse web and scientific sources in both English and Chinese. To maximize usability, we provide detailed statistical analyses and equip the datasets with quality signals, enabling researchers to easily filter and select data for specific tasks. Our work provides the community with a versatile data format and substantial resources, offering a foundation for new research in pre-training strategies and the development of more powerful knowledge-intensive LMMs.

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

TL;DR

This work identifies persistent perceptual and reasoning errors in large multimodal models and introduces PIN, a knowledge-intensive data format that pairs semantically rich Markdown with holistic document images to preserve fine-grained textual structures while capturing global layout. It operationalizes PIN into two large datasets, PIN-200M and PIN-14M, spanning diverse English and Chinese sources and enhanced by quality signals for targeted data filtering. The authors detail a unified data-processing pipeline, multiple subset demonstrations (including PIN-Arxiv, PIN-PMC, DocLayNet, and web/text sources), and a suite of training strategies that leverage both image-text pairs and interleaved documents, along with potential new tasks like image-from-Markdown generation. Together, PIN provides a scalable, versatile foundation for pre-training knowledge-intensive multimodal models and advancing document understanding in scientific and web domains.

Abstract

Recent advancements in large multimodal models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. To address these issues, we introduce PIN (Paired and INterleaved multimodal documents), a novel data format designed to foster a deeper integration of visual and textual knowledge. The PIN format uniquely combines semantically rich Markdown files, which preserve fine-grained textual structures, with holistic overall images that capture the complete document layout. Following this format, we construct and release two large-scale, open-source datasets: PIN-200M (~200 million documents) and PIN-14M (~14 million), compiled from diverse web and scientific sources in both English and Chinese. To maximize usability, we provide detailed statistical analyses and equip the datasets with quality signals, enabling researchers to easily filter and select data for specific tasks. Our work provides the community with a versatile data format and substantial resources, offering a foundation for new research in pre-training strategies and the development of more powerful knowledge-intensive LMMs.
Paper Structure (26 sections, 10 equations, 10 figures, 1 table)

This paper contains 26 sections, 10 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Comparisons of traditional multimodal formats with the proposed PIN format. The PIN format preserves rich knowledge attributes (e.g., bold text, highlighting, code blocks), supports semantic interaction between images and text within Markdown documents, and enhances knowledge representation through an overall image.
  • Figure 2: The file tree structure of an example dataset in PIN format.
  • Figure 3: An example data entry of JSONL files.
  • Figure 4: Samples from various subsets of the PIN-200M dataset. For each subset, one entry is extracted, showcasing both its markdown file section and the corresponding overall image.
  • Figure 5: The overview of our data process workflow.
  • ...and 5 more figures