Table of Contents
Fetching ...

GL-PGENet: A Parameterized Generation Framework for Robust Document Image Enhancement

Zhihong Tang

TL;DR

The paper tackles robust enhancement of color document images under multi-degradation in real-world capture scenarios. It proposes GL-PGENet, a two-stage, global-to-local framework that first learns lightweight global parameter regressions via a Global Perception Parameter Network (GPPNet) to produce $I_g$, then refines details through a Dual-Branch Local-Refine Network (DB-LRNet) that fuses a smoothing branch with a dense-block NestUNet to yield the final $I_e$ using learned parameters. This parametric-generation approach, combined with a two-stage large-scale synthetic pretraining (500k+ samples) and task-specific fine-tuning, achieves state-of-the-art SSIM scores on DocUNet ($0.7721$) and RealDAE ($0.9480$), while delivering about a 75% inference-time reduction for high-resolution images and strong cross-domain generalization. The work demonstrates practical viability for real-world document digitization and downstream tasks like OCR, by balancing enhancement quality with computational efficiency and robustness to domain shifts.

Abstract

Document Image Enhancement (DIE) serves as a critical component in Document AI systems, where its performance substantially determines the effectiveness of downstream tasks. To address the limitations of existing methods confined to single-degradation restoration or grayscale image processing, we present Global with Local Parametric Generation Enhancement Network (GL-PGENet), a novel architecture designed for multi-degraded color document images, ensuring both efficiency and robustness in real-world scenarios. Our solution incorporates three key innovations: First, a hierarchical enhancement framework that integrates global appearance correction with local refinement, enabling coarse-to-fine quality improvement. Second, a Dual-Branch Local-Refine Network with parametric generation mechanisms that replaces conventional direct prediction, producing enhanced outputs through learned intermediate parametric representations rather than pixel-wise mapping. This approach enhances local consistency while improving model generalization. Finally, a modified NestUNet architecture incorporating dense block to effectively fuse low-level pixel features and high-level semantic features, specifically adapted for document image characteristics. In addition, to enhance generalization performance, we adopt a two-stage training strategy: large-scale pretraining on a synthetic dataset of 500,000+ samples followed by task-specific fine-tuning. Extensive experiments demonstrate the superiority of GL-PGENet, achieving state-of-the-art SSIM scores of 0.7721 on DocUNet and 0.9480 on RealDAE. The model also exhibits remarkable cross-domain adaptability and maintains computational efficiency for high-resolution images without performance degradation, confirming its practical utility in real-world scenarios.

GL-PGENet: A Parameterized Generation Framework for Robust Document Image Enhancement

TL;DR

The paper tackles robust enhancement of color document images under multi-degradation in real-world capture scenarios. It proposes GL-PGENet, a two-stage, global-to-local framework that first learns lightweight global parameter regressions via a Global Perception Parameter Network (GPPNet) to produce , then refines details through a Dual-Branch Local-Refine Network (DB-LRNet) that fuses a smoothing branch with a dense-block NestUNet to yield the final using learned parameters. This parametric-generation approach, combined with a two-stage large-scale synthetic pretraining (500k+ samples) and task-specific fine-tuning, achieves state-of-the-art SSIM scores on DocUNet () and RealDAE (), while delivering about a 75% inference-time reduction for high-resolution images and strong cross-domain generalization. The work demonstrates practical viability for real-world document digitization and downstream tasks like OCR, by balancing enhancement quality with computational efficiency and robustness to domain shifts.

Abstract

Document Image Enhancement (DIE) serves as a critical component in Document AI systems, where its performance substantially determines the effectiveness of downstream tasks. To address the limitations of existing methods confined to single-degradation restoration or grayscale image processing, we present Global with Local Parametric Generation Enhancement Network (GL-PGENet), a novel architecture designed for multi-degraded color document images, ensuring both efficiency and robustness in real-world scenarios. Our solution incorporates three key innovations: First, a hierarchical enhancement framework that integrates global appearance correction with local refinement, enabling coarse-to-fine quality improvement. Second, a Dual-Branch Local-Refine Network with parametric generation mechanisms that replaces conventional direct prediction, producing enhanced outputs through learned intermediate parametric representations rather than pixel-wise mapping. This approach enhances local consistency while improving model generalization. Finally, a modified NestUNet architecture incorporating dense block to effectively fuse low-level pixel features and high-level semantic features, specifically adapted for document image characteristics. In addition, to enhance generalization performance, we adopt a two-stage training strategy: large-scale pretraining on a synthetic dataset of 500,000+ samples followed by task-specific fine-tuning. Extensive experiments demonstrate the superiority of GL-PGENet, achieving state-of-the-art SSIM scores of 0.7721 on DocUNet and 0.9480 on RealDAE. The model also exhibits remarkable cross-domain adaptability and maintains computational efficiency for high-resolution images without performance degradation, confirming its practical utility in real-world scenarios.

Paper Structure

This paper contains 19 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of the proposed GL-PGENet framework. The architecture follows a coarse-to-fine two-stage paradigm. (a) The GPPNet first estimates global enhancement parameters for brightness, contrast, and saturation transformations to generate globally enhanced images with illumination consistency. (b) The DB-LRNet refines detail features: one branch employs convolutional operations for image smoothing, while the other utilizes a dense block-integrated NestUNet to learn linear transformation parameters. The final enhanced image is synthesized through the fusion of dual-branch outputs, achieving a balance between high-frequency detail preservation and local contextual consistency adaptation in color document image enhancement.
  • Figure 2: Two-Stage Image Enhancement Process Visualization. (a) Original degraded images; (b) Ground-truth reference images; (c) Global enhancement results $I_g$ from GPPNet with optimized brightness, contrast, and saturation parameters; (d) Final refined outputs $I_e$ generated by DB-LRNet demonstrating preserved high-frequency details alongside illumination consistency.
  • Figure 3: Qualitative Comparison with State-of-the-Art DIE Methods. (a) Original degraded images; (b) Ground-truth reference images; (c) DocProj li2019document; (d) DocRes Zhang_2024_CVPR; (e) DocTr feng2021doctr; (f) GCDRNet zhang2023appearance; (g) Proposed GL-PGENet. Quantitative evaluation demonstrates the superior performance of our method in both structural preservation (particularly document detail enhancement and semantic legibility as shown in Row 5) and color processing (effective restoration and balanced color reproduction observed in Rows 3-4). Comparative results indicate that GL-PGENet achieves comprehensive improvements over existing benchmark methods across multiple perceptual criteria.
  • Figure 4: Visual Comparison: Baseline vs. Our Efficient Method. Visual comparison of inference approaches with baseline method (top row) versus our efficient implementation (bottom row), demonstrating negligible quality degradation despite $3\times$ acceleration. Quantitative analysis reveals only 1.06% relative SSIM reduction (0.9480 vs. 0.9379) while maintaining superior performance over prior arts.
  • Figure 5: Frequency Analysis of Natural Images and Document Images. There are obvious high-frequency components in the horizontal and vertical directions in the document image, but the energy in nature image is mainly concentrated in the low frequency part.
  • ...and 2 more figures