Sharing Key Semantics in Transformer Makes Efficient Image Restoration
Bin Ren, Yawei Li, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Ming-Hsuan Yang, Nicu Sebe
TL;DR
This work tackles the inefficiency of global self-attention in Vision Transformers for image restoration by introducing SemanIR, which builds a per-stage Key-Semantic Dictionary that links each degraded patch to its top-$k$ semantically related patches via KNN. The dictionary is shared across all transformer layers within a stage, restricting attention to semantically relevant patches and yielding near-linear complexity within each window. Across six IR tasks, SemanIR achieves state-of-the-art performance and favorable efficiency, supported by extensive ablations on the top-$k$ parameter and attention implementations (Triton, Torch-Gather, Torch-Mask). The approach demonstrates the value of stage-wide semantic sharing for restoring degraded images and offers practical insights into efficient ViT-based IR deployments, with code and models available at the authors' GitHub page.
Abstract
Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the emergence of Vision Transformers (ViTs) has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objects or regions. This inclusivity introduces computational inefficiencies, particularly noticeable with high input resolution, as it requires processing irrelevant information, thereby impeding efficiency. Additionally, for IR, it is commonly noted that small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process, as they contribute essential contextual cues crucial for accurate reconstruction. To address these challenges, we propose boosting IR's performance by sharing the key semantics via Transformer for IR (\ie, SemanIR) in this paper. Specifically, SemanIR initially constructs a sparse yet comprehensive key-semantic dictionary within each transformer stage by establishing essential semantic connections for every degraded patch. Subsequently, this dictionary is shared across all subsequent transformer blocks within the same stage. This strategy optimizes attention calculation within each block by focusing exclusively on semantically related components stored in the key-semantic dictionary. As a result, attention calculation achieves linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed SemanIR's state-of-the-art performance, quantitatively and qualitatively showcasing advancements. The visual results, code, and trained models are available at https://github.com/Amazingren/SemanIR.
