CAMixerSR: Only Details Need More "Attention"

Yan Wang; Yi Liu; Shijie Zhao; Junlin Li; Li Zhang

CAMixerSR: Only Details Need More "Attention"

Yan Wang, Yi Liu, Shijie Zhao, Junlin Li, Li Zhang

TL;DR

This work addresses the challenge of high-quality SR on very large images by unifying two dominant strategies: content-aware routing and advanced token mixers. It introduces CAMixer, a content-aware mixer that uses a predictor to allocate computation between convolution and deformable window-attention, guided by offsets $\Delta p$, a mixer mask $m$, and spatial/channel attentions, with a global classification loss to sharpen partitioning. Stacking CAMixers yields CAMixerSR, which delivers state-of-the-art quality-efficiency trade-offs across large-image SR, lightweight SR, and omnidirectional SR, outperforming several baselines while using fewer computations. The approach demonstrates practical impact for high-resolution SR and shows potential for integration with existing acceleration frameworks to further optimize performance and efficiency.

Abstract

To satisfy the rapidly increasing demands on the large image (2K-8K) super-resolution (SR), prevailing methods follow two independent tracks: 1) accelerate existing networks by content-aware routing, and 2) design better super-resolution networks via token mixer refining. Despite directness, they encounter unavoidable defects (e.g., inflexible route or non-discriminative processing) limiting further improvements of quality-complexity trade-off. To erase the drawbacks, we integrate these schemes by proposing a content-aware mixer (CAMixer), which assigns convolution for simple contexts and additional deformable window-attention for sparse textures. Specifically, the CAMixer uses a learnable predictor to generate multiple bootstraps, including offsets for windows warping, a mask for classifying windows, and convolutional attentions for endowing convolution with the dynamic property, which modulates attention to include more useful textures self-adaptively and improves the representation capability of convolution. We further introduce a global classification loss to improve the accuracy of predictors. By simply stacking CAMixers, we obtain CAMixerSR which achieves superior performance on large-image SR, lightweight SR, and omnidirectional-image SR.

CAMixerSR: Only Details Need More "Attention"

TL;DR

, a mixer mask

, and spatial/channel attentions, with a global classification loss to sharpen partitioning. Stacking CAMixers yields CAMixerSR, which delivers state-of-the-art quality-efficiency trade-offs across large-image SR, lightweight SR, and omnidirectional SR, outperforming several baselines while using fewer computations. The approach demonstrates practical impact for high-resolution SR and shows potential for integration with existing acceleration frameworks to further optimize performance and efficiency.

Abstract

Paper Structure (13 sections, 12 equations, 8 figures, 10 tables)

This paper contains 13 sections, 12 equations, 8 figures, 10 tables.

Introduction
Related Work
Method
Content-Aware Mixing
Network Architecture
Training Loss
Experiment
Implementation Details
Ablation Study
Large-Image SR
Lightweight SR
Omni-Directional-Image SR
Conclusion

Figures (8)

Figure 1: Comparison of ClassSR ClassSR framework and CAMixer. Left) the plain/complex patches are at varied levels of difficulty to restore. Middle) ClassSR crops input images to sub-images for discriminative processing through models of varied complexities. Right) we introduce a content-aware mixer (CAMixer) to calculate self-attention for complexity regions while convolution for simple context.
Figure 2: Performance (PSNR-FLOPs) comparison on Test8K. The green dashline indicates the trade-off curve of CAMixerSR.
Figure 3: Overview of the proposed CAMixer. CAMixer consists of three parts: Predictor, Self-Attention branch, and Convolution branch.
Figure 4: Visualizations of predicted mixer mask $m$ of CAMixerSR. The lighter the color, the larger the magnitude. The scores of attention windows are in black, and the ones of convolution are in white. The unmasked tokens with more complex content (higher score) are processed by self-attention.
Figure 5: Ablation study on attention ratio $\gamma$.
...and 3 more figures

CAMixerSR: Only Details Need More "Attention"

TL;DR

Abstract

CAMixerSR: Only Details Need More "Attention"

Authors

TL;DR

Abstract

Table of Contents

Figures (8)