Intra and Inter Parser-Prompted Transformers for Effective Image Restoration
Cong Wang, Jinshan Pan, Liyan Wang, Wei Wang
TL;DR
The paper tackles image restoration under unknown degradation by leveraging parser content from a large visual foundation model. It introduces PPTformer, a two-branch framework with IRNet for restoration and PPFGNet to generate parser-guided features, integrated via IN2PPT blocks comprising IntraPPA, InterPPA, and a Parser-Prompted Feed-forward Network, plus a Bidirectional Parser-Prompted Fusion (BiPPF). The core idea is to implicitly and explicitly utilize parser content during long-range attention and pixel-wise modulation to guide restoration, yielding improvements across four tasks: deraining, defocus deblurring, desnowing, and low-light enhancement. Experiments across these tasks show state-of-the-art performance, validating the effectiveness of fusing SAM-derived hierarchical structures into restoration, while offline parser generation can be memory-intensive. The approach provides a general path to infuse foundation-model-derived structure cues into low-level vision tasks, with potential for broader applicability and further integration of parser features during training.
Abstract
We propose Intra and Inter Parser-Prompted Transformers (PPTformer) that explore useful features from visual foundation models for image restoration. Specifically, PPTformer contains two parts: an Image Restoration Network (IRNet) for restoring images from degraded observations and a Parser-Prompted Feature Generation Network (PPFGNet) for providing IRNet with reliable parser information to boost restoration. To enhance the integration of the parser within IRNet, we propose Intra Parser-Prompted Attention (IntraPPA) and Inter Parser-Prompted Attention (InterPPA) to implicitly and explicitly learn useful parser features to facilitate restoration. The IntraPPA re-considers cross attention between parser and restoration features, enabling implicit perception of the parser from a long-range and intra-layer perspective. Conversely, the InterPPA initially fuses restoration features with those of the parser, followed by formulating these fused features within an attention mechanism to explicitly perceive parser information. Further, we propose a parser-prompted feed-forward network to guide restoration within pixel-wise gating modulation. Experimental results show that PPTformer achieves state-of-the-art performance on image deraining, defocus deblurring, desnowing, and low-light enhancement.
