Table of Contents
Fetching ...

NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement

Giordano Cicchetti, Danilo Comminiello

TL;DR

NAF-DPM presents a fast, activation-free diffusion framework for document enhancement that targets deblurring and binarization. It integrates a lightweight NAFNet-based initial predictor with a conditional diffusion model that refines residual high-frequency details, and employs an ODE-based solver (dpmsolver) to achieve rapid sampling. An OCR-guided differentiable finetuning module (via a CRNN and CTC loss) further improves character fidelity, reducing OCR errors. Across DeblurringDataset and DIBCO benchmarks, NAF-DPM achieves state-of-the-art or competitive results in pixel-level and perceptual metrics while substantially reducing character errors in OCR outputs, demonstrating practical impact for real-world document preprocessing.

Abstract

Real-world documents may suffer various forms of degradation, often resulting in lower accuracy in optical character recognition (OCR) systems. Therefore, a crucial preprocessing step is essential to eliminate noise while preserving text and key features of documents. In this paper, we propose NAF-DPM, a novel generative framework based on a diffusion probabilistic model (DPM) designed to restore the original quality of degraded documents. While DPMs are recognized for their high-quality generated images, they are also known for their large inference time. To mitigate this problem we provide the DPM with an efficient nonlinear activation-free (NAF) network and we employ as a sampler a fast solver of ordinary differential equations, which can converge in a few iterations. To better preserve text characters, we introduce an additional differentiable module based on convolutional recurrent neural networks, simulating the behavior of an OCR system during training. Experiments conducted on various datasets showcase the superiority of our approach, achieving state-of-the-art performance in terms of pixel-level and perceptual similarity metrics. Furthermore, the results demonstrate a notable character error reduction made by OCR systems when transcribing real-world document images enhanced by our framework. Code and pre-trained models are available at https://github.com/ispamm/NAF-DPM.

NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement

TL;DR

NAF-DPM presents a fast, activation-free diffusion framework for document enhancement that targets deblurring and binarization. It integrates a lightweight NAFNet-based initial predictor with a conditional diffusion model that refines residual high-frequency details, and employs an ODE-based solver (dpmsolver) to achieve rapid sampling. An OCR-guided differentiable finetuning module (via a CRNN and CTC loss) further improves character fidelity, reducing OCR errors. Across DeblurringDataset and DIBCO benchmarks, NAF-DPM achieves state-of-the-art or competitive results in pixel-level and perceptual metrics while substantially reducing character errors in OCR outputs, demonstrating practical impact for real-world document preprocessing.

Abstract

Real-world documents may suffer various forms of degradation, often resulting in lower accuracy in optical character recognition (OCR) systems. Therefore, a crucial preprocessing step is essential to eliminate noise while preserving text and key features of documents. In this paper, we propose NAF-DPM, a novel generative framework based on a diffusion probabilistic model (DPM) designed to restore the original quality of degraded documents. While DPMs are recognized for their high-quality generated images, they are also known for their large inference time. To mitigate this problem we provide the DPM with an efficient nonlinear activation-free (NAF) network and we employ as a sampler a fast solver of ordinary differential equations, which can converge in a few iterations. To better preserve text characters, we introduce an additional differentiable module based on convolutional recurrent neural networks, simulating the behavior of an OCR system during training. Experiments conducted on various datasets showcase the superiority of our approach, achieving state-of-the-art performance in terms of pixel-level and perceptual similarity metrics. Furthermore, the results demonstrate a notable character error reduction made by OCR systems when transcribing real-world document images enhanced by our framework. Code and pre-trained models are available at https://github.com/ispamm/NAF-DPM.
Paper Structure (18 sections, 18 equations, 9 figures, 6 tables, 2 algorithms)

This paper contains 18 sections, 18 equations, 9 figures, 6 tables, 2 algorithms.

Figures (9)

  • Figure 1: Random samples of image patches extracted from the Document Deblurring dataset, DIBCO2017 and H-DIBCO2018 DeblurringDatasetDIBCO2017DIBCO2018. The first column refers to degraded images; The second one refers to their counterparts restored by our framework. These enhanced images are high-quality and very similar to the original ones. Furthermore, the edges of the text appear to be sharper and well defined. All this benefits the work of an OCR that will be able to better recognize text.
  • Figure 2: Examples of document images used during this work. Images in subfigure (a) come from document deblurring OCR test dataset DeblurringDataset and their associated task is image deblurring. Images in subfigures (b),(c),(d) come from the annual Document Image Binarization Competition (DIBCO) DIBCO2017DIBCO2019DIBCO2018. and their associated task is image binarization.
  • Figure 3: NAF-DPM architecture. An initial predictor retrieves the low-frequency information, and then a denoiser network estimates the residual high-frequency details by iterative refinement. The high-frequency information is restored estimating the residual image that, at the end, is added back to the image predicted by the initial predictor network. As the backbone network for the initial predictor we employ an efficient nonlinear activation-free network (NAFNet NafNet). As the backbone network for the denoiser diffusion model we design a novel and effective variation of NAFNet that takes into consideration the conditioning of the timestep $t$ of diffusion models in order to improve performance in denoising and deblurring tasks. It is worth noting that the initial predictor significantly retrieves the structure of the document while the denoiser restores all the details that make the text correct and readable. To better preserve text characters, we finetune our framework with an additional differentiable module based on convolutional recurrent neural networks, simulating the behavior of an OCR system during training
  • Figure 4: Proposed internal structure of each NAF block of our denoiser diffusion model. We added a time processing branch, depicted in red color. This branch models the time embedding and transforms it into shift parameters $\gamma$ and $\beta$ that control the scale and bias terms of each normalization layer. SCA: Simple Channel Attention; MLP: Multi-Layer Perceptron.
  • Figure 5: Our proposed OCR-based finetuning enhances the character readability and reduces the character error rate made by our network during the sampling phase. We attach a CRNN module just after the output layer of our network. We perform word-level text extraction and we feed the CRNN module with all the extracted word-level image patches. The CRNN emulates the behaviour of a commercial OCR and recognizes text from input patches. The text is compared to the reference one and CTC Loss Function is computed. In the end, the weights of our network are updated using eq. \ref{['loss_finetune']}.
  • ...and 4 more figures