Table of Contents
Fetching ...

NVIDIA Nemotron Parse 1.1

Kateryna Chumachenko, Amala Sanjay Deshmukh, Jarno Seppanen, Ilia Karmanov, Chia-Chih Chen, Lukas Voegtle, Philipp Fischer, Marek Wawrzos, Saeid Motiian, Roman Ageev, Kedi Wu, Alexandre Milesi, Maryam Moosaei, Krzysztof Pawelec, Padmavathy Subramanian, Mehrzad Samadi, Xin Yu, Celina Dear, Sarah Stoddard, Jenna Diamond, Jesse Oliver, Leanna Chraghchian, Patrick Skelly, Tom Balough, Yao Xu, Jane Polak Scowcroft, Daniel Korzekwa, Darragh Hanley, Sandip Bhaskar, Timo Roman, Karan Sapra, Andrew Tao, Bryan Catanzaro

TL;DR

Nemotron-Parse-1.1 presents an end-to-end vision-language approach for document-level OCR that jointly outputs formatted text, bounding boxes, and semantic labels with preserved reading order. It employs a RADIO-based ViT-H/16 encoder, a compact 10-layer decoder totaling 885M parameters, NoPE to enable long-context inference, and a multi-token inference scheme to accelerate decoding; a faster Nemotron-Parse-TC variant uses pixel-shuffle to shorten sequence length. Trained on a diverse data blend (NVpdftex, DocLayNet, Common Crawl, synthetic tables, multilingual OCR corpora, and TabRecSet), it introduces a fixed prompt interface and a maximal-information prompt to unify supervision across datasets. Empirical results show competitive OCR, robust reading order, strong table extraction, and solid multilingual performance, with open-source releases and optimized hardware packaging enabling scalable deployment.

Abstract

We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved capabilities across general OCR, markdown formatting, structured table parsing, and text extraction from pictures, charts, and diagrams. It also supports a longer output sequence length for visually dense documents. As with its predecessor, it extracts bounding boxes of text segments, as well as corresponding semantic classes. Nemotron-Parse-1.1 follows an encoder-decoder architecture with 885M parameters, including a compact 256M-parameter language decoder. It achieves competitive accuracy on public benchmarks making it a strong lightweight OCR solution. We release the model weights publicly on Huggingface, as well as an optimized NIM container, along with a subset of the training data as part of the broader Nemotron-VLM-v2 dataset. Additionally, we release Nemotron-Parse-1.1-TC which operates on a reduced vision token length, offering a 20% speed improvement with minimal quality degradation.

NVIDIA Nemotron Parse 1.1

TL;DR

Nemotron-Parse-1.1 presents an end-to-end vision-language approach for document-level OCR that jointly outputs formatted text, bounding boxes, and semantic labels with preserved reading order. It employs a RADIO-based ViT-H/16 encoder, a compact 10-layer decoder totaling 885M parameters, NoPE to enable long-context inference, and a multi-token inference scheme to accelerate decoding; a faster Nemotron-Parse-TC variant uses pixel-shuffle to shorten sequence length. Trained on a diverse data blend (NVpdftex, DocLayNet, Common Crawl, synthetic tables, multilingual OCR corpora, and TabRecSet), it introduces a fixed prompt interface and a maximal-information prompt to unify supervision across datasets. Empirical results show competitive OCR, robust reading order, strong table extraction, and solid multilingual performance, with open-source releases and optimized hardware packaging enabling scalable deployment.

Abstract

We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved capabilities across general OCR, markdown formatting, structured table parsing, and text extraction from pictures, charts, and diagrams. It also supports a longer output sequence length for visually dense documents. As with its predecessor, it extracts bounding boxes of text segments, as well as corresponding semantic classes. Nemotron-Parse-1.1 follows an encoder-decoder architecture with 885M parameters, including a compact 256M-parameter language decoder. It achieves competitive accuracy on public benchmarks making it a strong lightweight OCR solution. We release the model weights publicly on Huggingface, as well as an optimized NIM container, along with a subset of the training data as part of the broader Nemotron-VLM-v2 dataset. Additionally, we release Nemotron-Parse-1.1-TC which operates on a reduced vision token length, offering a 20% speed improvement with minimal quality degradation.

Paper Structure

This paper contains 25 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Layout analysis: bounding box detection and prediction of semantic classes
  • Figure 2: OCR, extraction of text formatting and mathematical equations in LaTeX and markdown.
  • Figure 3: Extraction of complex tables to LaTeX format.