Cascaded Robust Rectification for Arbitrary Document Images
Chaoyun Wang, Quanxin Huang, I-Chao Shen, Takeo Igarashi, Nanning Zheng, Caigui Jiang
TL;DR
This paper introduces a cascaded, three-stage rectification framework (L-Net for perspective, C-Net for geometry, F-Net for content) that progressively corrects arbitrary document distortions in a coarse-to-fine manner. It leverages canonical view normalization, an adaptive iterative refinement, and a principled loss design to achieve state-of-the-art results across multiple benchmarks. To address evaluation weaknesses, it proposes layout-aligned OCR metrics (AED/ACER) and masked geometric metrics (AD-M/AAD-M) that decouple rectification quality from OCR layouts and background regions. The approach demonstrates robust performance, efficiency, and practical applicability, with detailed ablations and comparisons to commercial tools, and outlines future work toward multi-view reconstruction to overcome single-view limitations.
Abstract
Document rectification in real-world scenarios poses significant challenges due to extreme variations in camera perspectives and physical distortions. Driven by the insight that complex transformations can be decomposed and resolved progressively, we introduce a novel multi-stage framework that progressively reverses distinct distortion types in a coarse-to-fine manner. Specifically, our framework first performs a global affine transformation to correct perspective distortions arising from the camera's viewpoint, then rectifies geometric deformations resulting from physical paper curling and folding, and finally employs a content-aware iterative process to eliminate fine-grained content distortions. To address limitations in existing evaluation protocols, we also propose two enhanced metrics: layout-aligned OCR metrics (AED/ACER) for a stable assessment that decouples geometric rectification quality from the layout analysis errors of OCR engines, and masked AD/AAD (AD-M/AAD-M) tailored for accurately evaluating geometric distortions in documents with incomplete boundaries. Extensive experiments show that our method establishes new state-of-the-art performance on multiple challenging benchmarks, yielding a substantial reduction of 14.1\%--34.7\% in the AAD metric and demonstrating superior efficacy in real-world applications. The code will be publicly available at https://github.com/chaoyunwang/ArbDR.
