Table of Contents
Fetching ...

Docling Technical Report

Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Nikolaos Livathinos, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Fabian Lindlbauer, Kasper Dinkla, Lokesh Mishra, Yusik Kim, Shubham Gupta, Rafael Teixeira de Lima, Valery Weber, Lucas Morin, Ingmar Meijer, Viktor Kuropiatnyk, Peter W. J. Staar

TL;DR

DocLing presents an open-source, end-to-end PDF-to-structured-output pipeline that runs locally on commodity hardware, integrating DocLayNet for layout analysis and TableFormer for table structure recognition. It offers a modular processing pipeline with multiple PDF backends, pretrained AI models, and extensible model pipelines, delivering JSON or Markdown outputs along with rich metadata. The work demonstrates favorable performance on CPU with options for batching, highlights trade-offs with alternative backends, and positions DocLing as a foundation for downstream AI tasks such as RAG and knowledge extraction, with integration into enterprise data tooling. The authors also outline future extensions and community-focused contributions under the MIT license.

Abstract

This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.

Docling Technical Report

TL;DR

DocLing presents an open-source, end-to-end PDF-to-structured-output pipeline that runs locally on commodity hardware, integrating DocLayNet for layout analysis and TableFormer for table structure recognition. It offers a modular processing pipeline with multiple PDF backends, pretrained AI models, and extensible model pipelines, delivering JSON or Markdown outputs along with rich metadata. The work demonstrates favorable performance on CPU with options for batching, highlights trade-offs with alternative backends, and positions DocLing as a foundation for downstream AI tasks such as RAG and knowledge extraction, with integration into enterprise data tooling. The authors also outline future extensions and community-focused contributions under the MIT license.

Abstract

This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.
Paper Structure (10 sections, 5 figures, 1 table)

This paper contains 10 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Sketch of Docling's default processing pipeline. The inner part of the model pipeline is easily customizable and extensible.
  • Figure 2: Title page of the DocLayNet paper (https://arxiv.org/pdf/2206.01062) - left PDF, right rendered Markdown. If recognized, metadata such as authors are appearing first under the title. Text content inside figures is currently dropped, the caption is retained and linked to the figure in the JSON representation (not shown).
  • Figure 3: Page 6 of the DocLayNet paper. If recognized, metadata such as authors are appearing first under the title. Elements recognized as page headers or footers are suppressed in Markdown to deliver uninterrupted content in reading order. Tables are inserted in reading order. The paragraph in "5. Experiments" wrapping over the column end is broken up in two and interrupted by the table.
  • Figure 4: Table 1 from the DocLayNet paper in the original PDF (A), as rendered Markdown (B) and in JSON representation (C). Spanning table cells, such as the multi-column header "triple inter-annotator mAP@0.5-0.95 (%)", is repeated for each column in the Markdown representation (B), which guarantees that every data point can be traced back to row and column headings only by its grid coordinates in the table. In the JSON representation, the span information is reflected in the fields of each table cell (C).
  • Figure :