Table of Contents
Fetching ...

Whispering to a Blackbox: Bootstrapping Frozen OCR with Visual Prompts

Samandar Samandarov, Nazirjon Ismoiljonov, Abdullah Sattorov, Temirlan Sabyrbayev

TL;DR

The Whisperer is introduced, a novel visual prompting framework that learns diffusion-based preprocessors to adapt inputs in pixel space, effectively whispering to the frozen OCR through its inputs, to improve an imperfect classifier without touching its weights.

Abstract

In the landscape of modern machine learning, frozen pre-trained models provide stability and efficiency but often underperform on specific tasks due to mismatched data distributions. This paper introduces the Whisperer, a novel visual prompting framework that learns diffusion-based preprocessors to adapt inputs in pixel space, effectively "whispering" enhancements to frozen downstream models like EasyOCR. By framing the process as behavioral cloning of stochastically discovered improvement policies, our method achieves an 8% absolute (10.6% relative) reduction in Character Error Rate (CER) on a challenging dataset of 300k degraded synthetic text images, surpassing hand-engineered baselines such as CLAHE. The key innovation is a four-stage training curriculum that uses behavioral cloning to amplify "lucky" improvements discovered through the stochastic exploration of a partially trained diffusion model. This approach is highly sample-efficient and avoids the pitfalls of traditional reinforcement learning. Crucially, we frame this not as naive reinforcement learning, but as behavioral cloning of an exploration policy: we stochastically sample intermediate diffusion outputs, select those that improve CER by chance, and then train the model to reproduce them. This bootstrapping curriculum (4 stages over 60 GPU-hours) amplifies random successes into a systematic strategy. In summary, by whispering to the frozen OCR through its inputs, we improve an imperfect classifier without touching its weights.

Whispering to a Blackbox: Bootstrapping Frozen OCR with Visual Prompts

TL;DR

The Whisperer is introduced, a novel visual prompting framework that learns diffusion-based preprocessors to adapt inputs in pixel space, effectively whispering to the frozen OCR through its inputs, to improve an imperfect classifier without touching its weights.

Abstract

In the landscape of modern machine learning, frozen pre-trained models provide stability and efficiency but often underperform on specific tasks due to mismatched data distributions. This paper introduces the Whisperer, a novel visual prompting framework that learns diffusion-based preprocessors to adapt inputs in pixel space, effectively "whispering" enhancements to frozen downstream models like EasyOCR. By framing the process as behavioral cloning of stochastically discovered improvement policies, our method achieves an 8% absolute (10.6% relative) reduction in Character Error Rate (CER) on a challenging dataset of 300k degraded synthetic text images, surpassing hand-engineered baselines such as CLAHE. The key innovation is a four-stage training curriculum that uses behavioral cloning to amplify "lucky" improvements discovered through the stochastic exploration of a partially trained diffusion model. This approach is highly sample-efficient and avoids the pitfalls of traditional reinforcement learning. Crucially, we frame this not as naive reinforcement learning, but as behavioral cloning of an exploration policy: we stochastically sample intermediate diffusion outputs, select those that improve CER by chance, and then train the model to reproduce them. This bootstrapping curriculum (4 stages over 60 GPU-hours) amplifies random successes into a systematic strategy. In summary, by whispering to the frozen OCR through its inputs, we improve an imperfect classifier without touching its weights.
Paper Structure (23 sections, 4 equations, 3 figures, 2 tables)

This paper contains 23 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Inference-time architecture of the Whisperer: a frozen perceptual encoder conditions a U-Net that produces clamped pixel-space updates applied iteratively to the degraded input.
  • Figure 2: Training curriculum (four-stage pipeline) used to bootstrap the diffusion-based visual prompt for a frozen OCR model.
  • Figure 3: Qualitative comparison of classical preprocessing filters. Each panel shows the original input and the corresponding visual outputs after applying different filters (e.g., CLAHE, unsharp masking, and gamma correction).