GL-PGENet: A Parameterized Generation Framework for Robust Document Image Enhancement
Zhihong Tang
TL;DR
The paper tackles robust enhancement of color document images under multi-degradation in real-world capture scenarios. It proposes GL-PGENet, a two-stage, global-to-local framework that first learns lightweight global parameter regressions via a Global Perception Parameter Network (GPPNet) to produce $I_g$, then refines details through a Dual-Branch Local-Refine Network (DB-LRNet) that fuses a smoothing branch with a dense-block NestUNet to yield the final $I_e$ using learned parameters. This parametric-generation approach, combined with a two-stage large-scale synthetic pretraining (500k+ samples) and task-specific fine-tuning, achieves state-of-the-art SSIM scores on DocUNet ($0.7721$) and RealDAE ($0.9480$), while delivering about a 75% inference-time reduction for high-resolution images and strong cross-domain generalization. The work demonstrates practical viability for real-world document digitization and downstream tasks like OCR, by balancing enhancement quality with computational efficiency and robustness to domain shifts.
Abstract
Document Image Enhancement (DIE) serves as a critical component in Document AI systems, where its performance substantially determines the effectiveness of downstream tasks. To address the limitations of existing methods confined to single-degradation restoration or grayscale image processing, we present Global with Local Parametric Generation Enhancement Network (GL-PGENet), a novel architecture designed for multi-degraded color document images, ensuring both efficiency and robustness in real-world scenarios. Our solution incorporates three key innovations: First, a hierarchical enhancement framework that integrates global appearance correction with local refinement, enabling coarse-to-fine quality improvement. Second, a Dual-Branch Local-Refine Network with parametric generation mechanisms that replaces conventional direct prediction, producing enhanced outputs through learned intermediate parametric representations rather than pixel-wise mapping. This approach enhances local consistency while improving model generalization. Finally, a modified NestUNet architecture incorporating dense block to effectively fuse low-level pixel features and high-level semantic features, specifically adapted for document image characteristics. In addition, to enhance generalization performance, we adopt a two-stage training strategy: large-scale pretraining on a synthetic dataset of 500,000+ samples followed by task-specific fine-tuning. Extensive experiments demonstrate the superiority of GL-PGENet, achieving state-of-the-art SSIM scores of 0.7721 on DocUNet and 0.9480 on RealDAE. The model also exhibits remarkable cross-domain adaptability and maintains computational efficiency for high-resolution images without performance degradation, confirming its practical utility in real-world scenarios.
