Table of Contents
Fetching ...

DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, Chi Zhang

TL;DR

The DianJin-OCR-R1 model is proposed, a reasoning-enhanced framework for recognition that trains VLMs in a reasoning-and-tool interleaved paradigm that consistently outperforms both its non-reasoning counterparts and expert models.

Abstract

Recent advances in vision-language models (VLMs) have enabled end-to-end document parsing and understanding, achieving strong performance on diverse optical character recognition (OCR) tasks. However, VLMs are prone to generate words that do not exist in the input image due to over-reliance on language priors. By contrast, traditional OCR models, whose architectures are tailored for specific recognition tasks, often achieve stronger fine-grained visual perception with fewer hallucinations, but they typically lack the contextual semantic understanding and reasoning capabilities needed in more challenging cases. To bridge this gap, we propose DianJin-OCR-R1, a reasoning-enhanced framework for recognition that trains VLMs in a reasoning-and-tool interleaved paradigm. Our DianJin-OCR-R1 model first recognizes the content in the input image through its own OCR capabilities, and then calls other expert models for extra results as references. After that, it is guided to "look again" at the image and compare its own recognized content with other results to find errors or omissions. Finally, it integrates all available evidence to generate a more accurate output. This design empowers the model to learn how to implicitly re-focus on the visual input and effectively leverage the results of other expert models for better performance. We evaluate our DianJin-OCR-R1 model on ReST and OmniDocBench, where it consistently outperforms both its non-reasoning counterparts and expert models, demonstrating the effectiveness of our method.

DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

TL;DR

The DianJin-OCR-R1 model is proposed, a reasoning-enhanced framework for recognition that trains VLMs in a reasoning-and-tool interleaved paradigm that consistently outperforms both its non-reasoning counterparts and expert models.

Abstract

Recent advances in vision-language models (VLMs) have enabled end-to-end document parsing and understanding, achieving strong performance on diverse optical character recognition (OCR) tasks. However, VLMs are prone to generate words that do not exist in the input image due to over-reliance on language priors. By contrast, traditional OCR models, whose architectures are tailored for specific recognition tasks, often achieve stronger fine-grained visual perception with fewer hallucinations, but they typically lack the contextual semantic understanding and reasoning capabilities needed in more challenging cases. To bridge this gap, we propose DianJin-OCR-R1, a reasoning-enhanced framework for recognition that trains VLMs in a reasoning-and-tool interleaved paradigm. Our DianJin-OCR-R1 model first recognizes the content in the input image through its own OCR capabilities, and then calls other expert models for extra results as references. After that, it is guided to "look again" at the image and compare its own recognized content with other results to find errors or omissions. Finally, it integrates all available evidence to generate a more accurate output. This design empowers the model to learn how to implicitly re-focus on the visual input and effectively leverage the results of other expert models for better performance. We evaluate our DianJin-OCR-R1 model on ReST and OmniDocBench, where it consistently outperforms both its non-reasoning counterparts and expert models, demonstrating the effectiveness of our method.

Paper Structure

This paper contains 24 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The visualization of the reason pipeline of our proposed method.
  • Figure 2: Overall pipeline for constructing reasoning data.
  • Figure 3: Visualization of look-again. The red curve denotes the attention between each token and the image tokens, whereas the black curve denotes the attention between each token and the other tokens.
  • Figure 4: Prompt used to construct reasoning data for seal recognition.
  • Figure 5: Prompt used to construct reasoning data for table recognition.
  • ...and 1 more figures