Table of Contents
Fetching ...

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang, Yuxuan Li, Ruoyun He, Haoran Wang, Wenyu Zhang, Wenbo Wang, Yicheng Wang, Xue Xiong, Ayong Zheng, Xiaoying Zuo, Ziwei Ou, Jingnan Gu, Quanhao Guo, Jianmin Wu, Dawei Yin, Dou Shen

Abstract

We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Abstract

We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.
Paper Structure (17 sections, 5 figures, 8 tables)

This paper contains 17 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Performance on OmniDocBench v1.5 across pipeline (left) and end-to-end (right) models. Qianfan-OCR (red) achieves 93.12, ranking first among all end-to-end models. The red dashed line indicates Qianfan-OCR's score for cross-category comparison.
  • Figure 2: Performance on general OCR and document understanding benchmarks. Top row: OlmOCR Bench, OCRBench, OCRBenchv2 (en), and OCRBenchv2 (zh). Bottom row: CCOCR-multilan, and document understanding tasks (DocVQA, TextVQA, OCRVQA) where two-stage OCR+LLM pipelines (hatched bars) show significant degradation compared to end-to-end models.
  • Figure 3: Architectural comparison between traditional two-stage OCR pipeline and Qianfan-OCR's end-to-end approach. (a) Traditional pipeline systems separate layout analysis and content recognition into independent stages, suffering from error propagation and irreversible loss of visual context. (b) Qianfan-OCR unifies all processing into a single vision-language model, accepting custom prompts for flexible task control and optionally generating intermediate layout reasoning via Layout-as-Thought ($\langle$think$\rangle$ tokens).
  • Figure 4: Cumulative OmniDocBench v1.5 score with samples sorted by layout label entropy (descending). In the high-entropy region (left), enabling thinking provides a stable advantage. As lower-entropy samples are included, the gap narrows and eventually reverses, with the no-think mode achieving a higher total score overall.
  • Figure 5: Layout-as-Thought example on a math exam paper. Left: Original document image. Right: Visualization of bounding boxes generated during the thinking phase, with element types color-coded (e.g., text, image, paragraph_title, vision_footnote).