PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
Junjie Wang, Yuxiang Zhang, Minghao Liu, Yin Zhang, Yatai Ji, Weihao Xuan, Nie Lin, Kang Zhu, Zhiqiang Lin, Yiming Ren, Chunyang Jiang, Yiyao Yu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Qunshu Lin, Yujiu Yang, Ge Zhang, Ruibin Yuan, Bei Chen, Wenhu Chen
TL;DR
This work identifies persistent perceptual and reasoning errors in large multimodal models and introduces PIN, a knowledge-intensive data format that pairs semantically rich Markdown with holistic document images to preserve fine-grained textual structures while capturing global layout. It operationalizes PIN into two large datasets, PIN-200M and PIN-14M, spanning diverse English and Chinese sources and enhanced by quality signals for targeted data filtering. The authors detail a unified data-processing pipeline, multiple subset demonstrations (including PIN-Arxiv, PIN-PMC, DocLayNet, and web/text sources), and a suite of training strategies that leverage both image-text pairs and interleaved documents, along with potential new tasks like image-from-Markdown generation. Together, PIN provides a scalable, versatile foundation for pre-training knowledge-intensive multimodal models and advancing document understanding in scientific and web domains.
Abstract
Recent advancements in large multimodal models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. To address these issues, we introduce PIN (Paired and INterleaved multimodal documents), a novel data format designed to foster a deeper integration of visual and textual knowledge. The PIN format uniquely combines semantically rich Markdown files, which preserve fine-grained textual structures, with holistic overall images that capture the complete document layout. Following this format, we construct and release two large-scale, open-source datasets: PIN-200M (~200 million documents) and PIN-14M (~14 million), compiled from diverse web and scientific sources in both English and Chinese. To maximize usability, we provide detailed statistical analyses and equip the datasets with quality signals, enabling researchers to easily filter and select data for specific tasks. Our work provides the community with a versatile data format and substantial resources, offering a foundation for new research in pre-training strategies and the development of more powerful knowledge-intensive LMMs.
