Table of Contents
Fetching ...

High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net

Zinuo Li, Xuhang Chen, Chi-Man Pun, Xiaodong Cun

TL;DR

This work tackles high‑resolution document shadow removal by introducing SD7K, a large real‑world dataset with over 7k shadow/shadow‑free pairs under diverse lighting, and FSENet, a frequency‑aware network that decouples processing across frequency bands via a Laplacian Pyramid. The low‑frequency deshading path uses Dimension‑Aware Transformer blocks and a Tri‑layer Attention Alignment module to correct illumination, while a high‑frequency restoration path learns contours to recover fine details, guided by a loss combining smoothly weighted L1 and SSIM terms. The combination of a large, varied dataset and a frequency‑aware architecture yields state‑of‑the‑art results on SD7K and existing benchmarks, with ablations validating the contributions of LP depth, DAT/DFE/TAA, and the high‑frequency contour module. This work has practical impact for improving readability and downstream document understanding tasks, particularly in real‑world capture scenarios where shadows are unavoidable, albeit at the cost of higher computation and non‑real‑time performance on edge devices.

Abstract

Shadows often occur when we capture the documents with casual equipment, which influences the visual quality and readability of the digital copies. Different from the algorithms for natural shadow removal, the algorithms in document shadow removal need to preserve the details of fonts and figures in high-resolution input. Previous works ignore this problem and remove the shadows via approximate attention and small datasets, which might not work in real-world situations. We handle high-resolution document shadow removal directly via a larger-scale real-world dataset and a carefully designed frequency-aware network. As for the dataset, we acquire over 7k couples of high-resolution (2462 x 3699) images of real-world document pairs with various samples under different lighting circumstances, which is 10 times larger than existing datasets. As for the design of the network, we decouple the high-resolution images in the frequency domain, where the low-frequency details and high-frequency boundaries can be effectively learned via the carefully designed network structure. Powered by our network and dataset, the proposed method clearly shows a better performance than previous methods in terms of visual quality and numerical results. The code, models, and dataset are available at: https://github.com/CXH-Research/DocShadow-SD7K

High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net

TL;DR

This work tackles high‑resolution document shadow removal by introducing SD7K, a large real‑world dataset with over 7k shadow/shadow‑free pairs under diverse lighting, and FSENet, a frequency‑aware network that decouples processing across frequency bands via a Laplacian Pyramid. The low‑frequency deshading path uses Dimension‑Aware Transformer blocks and a Tri‑layer Attention Alignment module to correct illumination, while a high‑frequency restoration path learns contours to recover fine details, guided by a loss combining smoothly weighted L1 and SSIM terms. The combination of a large, varied dataset and a frequency‑aware architecture yields state‑of‑the‑art results on SD7K and existing benchmarks, with ablations validating the contributions of LP depth, DAT/DFE/TAA, and the high‑frequency contour module. This work has practical impact for improving readability and downstream document understanding tasks, particularly in real‑world capture scenarios where shadows are unavoidable, albeit at the cost of higher computation and non‑real‑time performance on edge devices.

Abstract

Shadows often occur when we capture the documents with casual equipment, which influences the visual quality and readability of the digital copies. Different from the algorithms for natural shadow removal, the algorithms in document shadow removal need to preserve the details of fonts and figures in high-resolution input. Previous works ignore this problem and remove the shadows via approximate attention and small datasets, which might not work in real-world situations. We handle high-resolution document shadow removal directly via a larger-scale real-world dataset and a carefully designed frequency-aware network. As for the dataset, we acquire over 7k couples of high-resolution (2462 x 3699) images of real-world document pairs with various samples under different lighting circumstances, which is 10 times larger than existing datasets. As for the design of the network, we decouple the high-resolution images in the frequency domain, where the low-frequency details and high-frequency boundaries can be effectively learned via the carefully designed network structure. Powered by our network and dataset, the proposed method clearly shows a better performance than previous methods in terms of visual quality and numerical results. The code, models, and dataset are available at: https://github.com/CXH-Research/DocShadow-SD7K
Paper Structure (17 sections, 3 equations, 11 figures, 5 tables)

This paper contains 17 sections, 3 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Visual results of input document shadow image (a), classic method (b), supervised method (c), weakly-supervised method (d), unsupervised method (e) and ours (f). Our model removes the shadow while preserving the original document's content and aspect ratio.
  • Figure 2: (a) our data acquisition setup for constructing the dataset. (b) a remote phone control shutter where the user only needs to click a button to capture a pair of shadow-free/shadow images. (c) some showcases of our occluders.
  • Figure 3: Data distribution of SD7K and quantitative comparison across all document shadow datasets.
  • Figure 4: Example shadow and shadow-free images from SD7K.
  • Figure 5: The network structure of our proposed FSENet. Following liang2021high, given a high-resolution image $I\in \mathbb{R}^{H\times W\times 3}$, we first use Laplacian Pyramid (e.g. Depth $D$ = 2 in this case) to decompose the images to multiple frequency components. The Black arrows: The low-frequency part $L_3\in \mathbb{R}^{\frac{H}{2^D}\times \frac{W}{2^D}\times C}$ are refined to $L_3^{'}\in \mathbb{R}^{\frac{H}{2^D}\times \frac{W}{2^D}\times C}$ utilizing DAT, TAA and DFE block. Cyan arrows: For the high-frequency part $L_2\in \mathbb{R}^{\frac{H}{2^{D-1}}\times \frac{W}{2^{D-1}}\times C}$, a contour $C_{L_2}\in \mathbb{R}^{\frac{H}{2^{D-1}}\times \frac{W}{2^{D-1}}\times 1}$ is learned to bridge the low-frequency and high-frequency features. Red arrows: For the remaining components with higher frequency, the learned contour is successively upsampled and refined using the proposed SPP and TRM.
  • ...and 6 more figures