Table of Contents
Fetching ...

From Plausibility to Verifiability: Risk-Controlled Generative OCR for Vision-Language Models

Weile Gong, Yiping Zuo, Zijian Lu, Xin He, Weibei Fan, Chen Dai

Abstract

Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors semantic plausibility, whereas OCR requires outputs that are visually grounded and geometrically verifiable. This mismatch produces severe errors, especially over-generation and unsupported substitutions, creating deployment risk even when benchmark accuracy remains high. We therefore formulate frozen VLM OCR as a selective accept/abstain problem and propose a model-agnostic Geometric Risk Controller. The controller probes multiple structured views of the same input, applies lightweight structural screening, and accepts a transcription only when cross-view consensus and stability satisfy predefined criteria, yielding a small family of operating points. Experiments on frozen VLM backbones and standard OCR benchmarks show consistent reductions in extreme-error risk and catastrophic over-generation at predictable coverage costs. Reliable deployment of generative OCR with frozen VLMs benefits from explicit system-level risk control rather than unconstrained generation.

From Plausibility to Verifiability: Risk-Controlled Generative OCR for Vision-Language Models

Abstract

Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors semantic plausibility, whereas OCR requires outputs that are visually grounded and geometrically verifiable. This mismatch produces severe errors, especially over-generation and unsupported substitutions, creating deployment risk even when benchmark accuracy remains high. We therefore formulate frozen VLM OCR as a selective accept/abstain problem and propose a model-agnostic Geometric Risk Controller. The controller probes multiple structured views of the same input, applies lightweight structural screening, and accepts a transcription only when cross-view consensus and stability satisfy predefined criteria, yielding a small family of operating points. Experiments on frozen VLM backbones and standard OCR benchmarks show consistent reductions in extreme-error risk and catastrophic over-generation at predictable coverage costs. Reliable deployment of generative OCR with frozen VLMs benefits from explicit system-level risk control rather than unconstrained generation.
Paper Structure (20 sections, 21 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 21 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: System overview of the proposed Geometric Risk Controller (GRC). Given multi-view queries to a frozen vision-language model, the controller applies structural screening and cross-view consensus to produce a candidate string $s^*$, where $s^*$ is the consensus transcription across views. The operating point then decides whether to accept $s^*$ or abstain.
  • Figure 2: Representative accept--abstain evidence patterns under the fixed $K{=}5$ protocol at the default operating point ($m{=}3$). Fragmented valid outputs cause abstention, a majority can still be rejected under high dispersion, and borderline consensus is accepted when the valid views remain tightly clustered.
  • Figure 3: Qualitative OCR cases under the fixed deployment protocol. Each panel shows the crop, ground truth, the always-accept baseline, and the GRC decision at the default operating point ($m=3$). Cases include over-generation, unsupported substitution, correct acceptance, cross-view instability, a stable-but-wrong residual failure, and a borderline accepted case. The evidence tag reports valid-view count, consensus fraction $q$, and geometric length estimate $L_{\mathrm{geom}}$; for the instability case, it reports dispersion $\Delta$ instead of $L_{\mathrm{geom}}$.
  • Figure 4: Risk--coverage trajectories under the common fixed protocol for all three backbones on IIIT5K and ICDAR13. Each point is a predefined operating point indexed by $m$; larger $m$ yields lower coverage and lower covered-output risk. Filled markers denote the default operating point $m=3$.
  • Figure 5: Component effects and comparison to an external confidence-threshold baseline on LLaVA-Phi3 under the fixed protocol ($K{=}5$, $m{=}3$). Bars show Meltdown@2; labels show coverage. Full GRC achieves the lowest catastrophic exposure on both datasets while maintaining competitive coverage.