Table of Contents
Fetching ...

Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

Yulong Zhang, Tianyi Liang, Xinyue Huang, Erfei Cui, Xu Guo, Pei Chu, Chenhui Li, Ru Zhang, Wenhai Wang, Gongshen Liu

TL;DR

Consensus Entropy (CE) introduces a training-free, uncertainty-aware OCR framework that leverages multi-model agreement to automatically verify and improve OCR outputs. By modeling inter-model output convergence for correct predictions and divergence for errors, CE derives a global uncertainty score $\delta$ to guide ensemble fusion or routing to a stronger model via a threshold $\theta$. The approach yields state-of-the-art results on OCR benchmarks, enables high-quality output selection, and reduces computation by routing only a small fraction of inputs to expensive models. Its training-free, plug-and-play design makes CE practical for data filtering, quality control, and self-improving OCR pipelines in real-world multimodal systems.

Abstract

The Optical Character Recognition (OCR) task is important for evaluating Vision-Language Models (VLMs) and providing high-quality data sources for LLM training data. While state-of-the-art VLMs show improved average OCR accuracy, they still struggle with sample-level quality degradation and lack reliable automatic detection of low-quality outputs. We introduce Consensus Entropy (CE), a training-free post-inference method that quantifies OCR uncertainty by aggregating outputs from multiple VLMs. Our approach exploits a key insight: correct VLM OCR predictions converge in output space while errors diverge. We develop a lightweight multi-model framework that effectively identifies problematic samples, selects the best outputs and combines model strengths. Experiments across multiple OCR benchmarks and VLMs demonstrate that CE outperforms VLM-as-judge approaches and single-model baselines at the same cost and achieves state-of-the-art results across multiple metrics. For instance, our solution demonstrates: achieving 15.2% higher F1 scores than VLM-as-judge methods in quality verification, delivering 6.0% accuracy gains on mathematical calculation tasks, and requiring rephrasing only 7.3% of inputs while maintaining overall performance. Notably, the entire process requires neither training nor supervision while maintaining plug-and-play functionality throughout.

Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

TL;DR

Consensus Entropy (CE) introduces a training-free, uncertainty-aware OCR framework that leverages multi-model agreement to automatically verify and improve OCR outputs. By modeling inter-model output convergence for correct predictions and divergence for errors, CE derives a global uncertainty score to guide ensemble fusion or routing to a stronger model via a threshold . The approach yields state-of-the-art results on OCR benchmarks, enables high-quality output selection, and reduces computation by routing only a small fraction of inputs to expensive models. Its training-free, plug-and-play design makes CE practical for data filtering, quality control, and self-improving OCR pipelines in real-world multimodal systems.

Abstract

The Optical Character Recognition (OCR) task is important for evaluating Vision-Language Models (VLMs) and providing high-quality data sources for LLM training data. While state-of-the-art VLMs show improved average OCR accuracy, they still struggle with sample-level quality degradation and lack reliable automatic detection of low-quality outputs. We introduce Consensus Entropy (CE), a training-free post-inference method that quantifies OCR uncertainty by aggregating outputs from multiple VLMs. Our approach exploits a key insight: correct VLM OCR predictions converge in output space while errors diverge. We develop a lightweight multi-model framework that effectively identifies problematic samples, selects the best outputs and combines model strengths. Experiments across multiple OCR benchmarks and VLMs demonstrate that CE outperforms VLM-as-judge approaches and single-model baselines at the same cost and achieves state-of-the-art results across multiple metrics. For instance, our solution demonstrates: achieving 15.2% higher F1 scores than VLM-as-judge methods in quality verification, delivering 6.0% accuracy gains on mathematical calculation tasks, and requiring rephrasing only 7.3% of inputs while maintaining overall performance. Notably, the entire process requires neither training nor supervision while maintaining plug-and-play functionality throughout.

Paper Structure

This paper contains 33 sections, 9 equations, 13 figures, 9 tables, 3 algorithms.

Figures (13)

  • Figure 1: Prediction behaviors across entropy levels. Each plot visualizes VLM predictions in a 2D space. In low-entropy cases (a), predictions tightly cluster around the ground truth (green), while in medium (b) and high-entropy (c) settings, predictions increasingly diverge.
  • Figure 2: Normalized entropy (0-1) analysis of four combination methods across different distribution types and grid resolutions. Single point distribution (dashed line) and uniform distribution (dotted line) serve as lower and upper bounds. Error bars represent standard deviation over five runs.
  • Figure 3: Model Performance on OCRBench under Different CE Thresholds. CE values are computed with two reference models (ref1: Qwen2-VL-7B, ref2: Qwen2-VL-72BQwen2-VL), The 210models_avg, representing the average performance of 210 models, demonstrates the wide applicability of CE in filtering correct outputs from VLMs. The shaded area is formed by the accuracy curves of ref1 and ref2, while the solid line represents their average.
  • Figure 4: OCR Performance Comparison. Cumulative scores of different models plotted against token length, with Self-Consistency (SC@3) models in blue, Routing models in red-orange, and Single models in gray. Top performers in each category shown with thicker lines.
  • Figure 5: Case 1: High-quality Chinese OCR sample with strong agreement between human evaluation (4/4), CE score (0.9891), and VLM-as-Judge score (0.9).
  • ...and 8 more figures