Table of Contents
Fetching ...

OCR-Agent: Agentic OCR with Capability and Memory Reflection

Shimin Wen, Zeyu Zhang, Xingdou Bian, Hongjie Zhu, Lulu He, Layi Shama, Daji Ergu, Ying Cai

TL;DR

This paper proposes a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection, and demonstrates that structured, self-aware reflection can significantly enhance VLMs'reasoning robustness without additional training.

Abstract

Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.

OCR-Agent: Agentic OCR with Capability and Memory Reflection

TL;DR

This paper proposes a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection, and demonstrates that structured, self-aware reflection can significantly enhance VLMs'reasoning robustness without additional training.

Abstract

Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.
Paper Structure (17 sections, 8 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 8 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of OCR-Agent. The model iteratively refines its answer by (1) Capability Reflection to filter infeasible actions (e.g., "enhance image"), and (2) Memory Reflection to avoid repeating past mistakes, enabling stable, training-free self-correction.
  • Figure 2: Overview of OCR-Agent.
  • Figure 3: A visual comparison of results from the Naive, CoT, and our proposed methods. The examples in the "Ours" column represent the final output after the last round of iteration.
  • Figure 4: The performance improvement of OCR-Agent in understanding and reasoning as the number of trials increases. (a)&(c) Understanding Scores in English and Chinese: It depicts how the understanding capabilities of CoT only, Self-Refine, and OCR-Agent evolve with increasing trial numbers. (b)&(d) Reasoning Scores in English and Chinese: It shows the reasoning ability changes of these methods as trial numbers proceed.