Table of Contents
Fetching ...

DODO: Discrete OCR Diffusion Models

Sean Man, Roy Ganz, Roi Ronen, Shahar Tsiper, Shai Mazor, Niv Nayman

TL;DR

DODO is introduced, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR, and achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.

Abstract

Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.

DODO: Discrete OCR Diffusion Models

TL;DR

DODO is introduced, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR, and achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.

Abstract

Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.
Paper Structure (44 sections, 5 equations, 10 figures, 3 tables)

This paper contains 44 sections, 5 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: DODO: High-throughput parallel generation. Unlike autoregressive models constrained to a strict left-to-right sequence, DODO generates text across the entire canvas simultaneously (with same color) based on visual confidence. In this example, it resolves $148$ tokens in just $15$ forward passes ($\approx 10$ tokens/step on average). Notably, large, distinct regions appear early, while ambiguous high-frequency tokens (e.g., punctuation) are deferred to later steps.
  • Figure 2: Semantically flexible vs. semantically rigid vision–language tasks.Left: Image captioning admits multiple, semantically equivalent descriptions of the same image. Different decoding trajectories can converge to distinct but equally valid captions, and lexical or structural variations are naturally absorbed. Right: OCR requires a single, exact transcription determined by the image. Even minimal local deviations, such as an incorrect token choice or boundary, render the output incorrect. As a result, conditioned on the image, OCR exhibits extremely low output variability, which makes it a natural candidate for parallel decoding, but also a demanding setting in which errors cannot be compensated by alternative phrasings or later corrections.
  • Figure 3: Conditional independence assumption. Parallel decoding assumes masked that masked tokens can be predicted independently given the context. (Top) In open-ended tasks, ambiguity between valid options (e.g., "Eiffel Tower" vs. "Great Wall") risks sampling incoherent mixtures like "Eiffel Wall." (Bottom) In deterministic regimes like OCR, the strong visual signal resolves this ambiguity, enabling conflict-free parallel decoding.
  • Figure 4: Full vs. block diffusion. In standard full diffusion (left), MDM sampling is applied globally to the entire sequence. In contrast, block diffusion (right) restricts parallel sampling to discrete windows, processing blocks sequentially from left to right.
  • Figure 5: Inference throughput comparison. While standard DODO matches the speed of the autoregressive Qwen 2.5 VL baseline ($\approx 21$ tokens/sec), the DODO fast leverages block-causal attention and KV-caching to triple the throughput to $\approx 63$ tokens/sec, establishing a new efficiency standard for diffusion-based VLMs.
  • ...and 5 more figures