Table of Contents
Fetching ...

BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks

Juan Rodriguez, Xiangru Jian, Siba Smarak Panigrahi, Tianyu Zhang, Aarash Feizi, Abhay Puri, Akshay Kalkunte, François Savard, Ahmed Masry, Shravan Nayak, Rabiul Awal, Mahsa Massoud, Amirhossein Abaskohi, Zichao Li, Suyuchen Wang, Pierre-André Noël, Mats Leon Richter, Saverio Vadacchino, Shubham Agarwal, Sanket Biswas, Sara Shanian, Ying Zhang, Noah Bolger, Kurt MacDonald, Simon Fauvel, Sathwik Tejaswi, Srinivas Sunkara, Joao Monteiro, Krishnamurthy DJ Dvijotham, Torsten Scholak, Nicolas Chapados, Sepideh Kharagani, Sean Hughes, M. Özsu, Siva Reddy, Marco Pedersoli, Yoshua Bengio, Christopher Pal, Issam Laradji, Spandana Gella, Perouz Taslakian, David Vazquez, Sai Rajeswar

TL;DR

BigDocs presents a large, license-permissive open dataset (BigDocs-7.5M) of image-text pairs for visually rich documents, designed to support continual pretraining and downstream finetuning of multimodal models on document tasks (OCR, parsing, captioning, QA) while ensuring traceable licensing and low contamination. It introduces BigDocs-Bench, a 10-task benchmark suite focusing on long-format, code-like outputs (HTML, LaTeX, SVG, JSON) and GUI reasoning, with robust automatic filtering and human-in-the-loop verification. Empirical results show models trained on BigDocs outperform baselines on general document benchmarks and excel on BigDocs-Bench tasks, including strong human-preference signals for BigDocs outputs over instruction-tuned and GPT-4o baselines. The work also provides the BigDocs Toolkit and a unified metadata framework to promote transparency, reproducibility, and responsible open research in multimodal document understanding, with the aim of empowering academics and the open-source community to advance practical document intelligence.

Abstract

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks

TL;DR

BigDocs presents a large, license-permissive open dataset (BigDocs-7.5M) of image-text pairs for visually rich documents, designed to support continual pretraining and downstream finetuning of multimodal models on document tasks (OCR, parsing, captioning, QA) while ensuring traceable licensing and low contamination. It introduces BigDocs-Bench, a 10-task benchmark suite focusing on long-format, code-like outputs (HTML, LaTeX, SVG, JSON) and GUI reasoning, with robust automatic filtering and human-in-the-loop verification. Empirical results show models trained on BigDocs outperform baselines on general document benchmarks and excel on BigDocs-Bench tasks, including strong human-preference signals for BigDocs outputs over instruction-tuned and GPT-4o baselines. The work also provides the BigDocs Toolkit and a unified metadata framework to promote transparency, reproducibility, and responsible open research in multimodal document understanding, with the aim of empowering academics and the open-source community to advance practical document intelligence.

Abstract

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

Paper Structure

This paper contains 67 sections, 22 figures, 29 tables.

Figures (22)

  • Figure 1: BigDocs: A Large-Scale Structured Continual Pretraining and Finetuning Dataset. The inner circle represents the distribution of BigDocs, detailing the categories. The outer circle displays the specific datasets compiled to form 7 million image-text pairs. Datasets with * denotes our contribution.
  • Figure 2: BigDocs-7.5M Dataset Curation. The figure illustrates the extraction, filtering, and curation process of BigDocs-7.5M, which emphasizes maintaining permissive licensing. To build BigDocs-7.5M, we first gather publicly-available vision-language datasets, particularly those centered on document analysis, and apply a rigorous filtering process. We then augment these datasets with our own crawled data. Finally, we standardize all samples and tasks into a unified format to produce BigDocs-7.5M.
  • Figure 3: Assessing data contamination (smaller is better). The radial axis (log scale) indicates the proportion of images from the evaluation dataset that exhibit similarity to a training sample beyond a given threshold according to CLIP. Human evaluations indicate that most instances captured at a threshold of $0.98$ are problematic, and most problematic samples are identified at a threshold of $0.96$. Except for MMMU and DudeMini, BigDocs-7.5M (blue/darkred) is less contaminated compared to DocStruct4M (red).
  • Figure 4: 8 of the new tasks introduced in BigDocs-Bench. These tasks share a focus on understanding the underlying structure of visually rich documents, with many also requiring generating lengthy outputs, such as SVG and HTML code. More tasks are shown in Figure \ref{['fig:bigdocs-ft-tasks-gui-example']} and \ref{['fig:bigdocs-ft-tasks-chart-to-markdown']}.
  • Figure 5: Human evaluation results comparing Phi3.5 BigDocs-Bench against Phi3.5 Instruct and GPT-4o on two tasks: Table2LaTex (Left) and Screenshot2HTML (Right).
  • ...and 17 more figures