Table of Contents
Fetching ...

PocketSR: The Super-Resolution Expert in Your Pocket Mobiles

Haoze Sun, Linfeng Jiang, Fan Li, Renjing Pei, Zhixin Wang, Yong Guo, Jiaqi Xu, Haoyu Chen, Jin Han, Fenglong Song, Yujiu Yang, Wenbo Li

TL;DR

PocketSR tackles the high cost of diffusion-based RealSR by substituting the heavy VAE with a compact Lite Encoder-Decoder and by progressively transferring generative priors to lightweight U-Net modules through online annealing pruning and multi-layer feature distillation. The two-stage training yields a 146M-parameter, one-step SR model that processes 4K images in about 0.8 seconds and delivers perceptual results on par with state-of-the-art methods, while achieving real-time performance on a single GPU. This work demonstrates that diffusion priors can be leveraged for practical, edge-friendly SR without sacrificing fidelity, enabling deployment in mobile photography and related applications. Limitations include reduced detail generation under severe degradations and a need for further hardware-aware optimization across edge platforms.

Abstract

Real-world image super-resolution (RealSR) aims to enhance the visual quality of in-the-wild images, such as those captured by mobile phones. While existing methods leveraging large generative models demonstrate impressive results, the high computational cost and latency make them impractical for edge deployment. In this paper, we introduce PocketSR, an ultra-lightweight, single-step model that brings generative modeling capabilities to RealSR while maintaining high fidelity. To achieve this, we design LiteED, a highly efficient alternative to the original computationally intensive VAE in SD, reducing parameters by 97.5% while preserving high-quality encoding and decoding. Additionally, we propose online annealing pruning for the U-Net, which progressively shifts generative priors from heavy modules to lightweight counterparts, ensuring effective knowledge transfer and further optimizing efficiency. To mitigate the loss of prior knowledge during pruning, we incorporate a multi-layer feature distillation loss. Through an in-depth analysis of each design component, we provide valuable insights for future research. PocketSR, with a model size of 146M parameters, processes 4K images in just 0.8 seconds, achieving a remarkable speedup over previous methods. Notably, it delivers performance on par with state-of-the-art single-step and even multi-step RealSR models, making it a highly practical solution for edge-device applications.

PocketSR: The Super-Resolution Expert in Your Pocket Mobiles

TL;DR

PocketSR tackles the high cost of diffusion-based RealSR by substituting the heavy VAE with a compact Lite Encoder-Decoder and by progressively transferring generative priors to lightweight U-Net modules through online annealing pruning and multi-layer feature distillation. The two-stage training yields a 146M-parameter, one-step SR model that processes 4K images in about 0.8 seconds and delivers perceptual results on par with state-of-the-art methods, while achieving real-time performance on a single GPU. This work demonstrates that diffusion priors can be leveraged for practical, edge-friendly SR without sacrificing fidelity, enabling deployment in mobile photography and related applications. Limitations include reduced detail generation under severe degradations and a need for further hardware-aware optimization across edge platforms.

Abstract

Real-world image super-resolution (RealSR) aims to enhance the visual quality of in-the-wild images, such as those captured by mobile phones. While existing methods leveraging large generative models demonstrate impressive results, the high computational cost and latency make them impractical for edge deployment. In this paper, we introduce PocketSR, an ultra-lightweight, single-step model that brings generative modeling capabilities to RealSR while maintaining high fidelity. To achieve this, we design LiteED, a highly efficient alternative to the original computationally intensive VAE in SD, reducing parameters by 97.5% while preserving high-quality encoding and decoding. Additionally, we propose online annealing pruning for the U-Net, which progressively shifts generative priors from heavy modules to lightweight counterparts, ensuring effective knowledge transfer and further optimizing efficiency. To mitigate the loss of prior knowledge during pruning, we incorporate a multi-layer feature distillation loss. Through an in-depth analysis of each design component, we provide valuable insights for future research. PocketSR, with a model size of 146M parameters, processes 4K images in just 0.8 seconds, achieving a remarkable speedup over previous methods. Notably, it delivers performance on par with state-of-the-art single-step and even multi-step RealSR models, making it a highly practical solution for edge-device applications.

Paper Structure

This paper contains 25 sections, 4 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Visualization of the real-world image super-resolution results and efficiency of our proposed method. To enable the practical application of diffusion-based SR models, we introduce PocketSR, an ultra-lightweight, single-step solution. The top visual examples demonstrate that PocketSR achieves high-quality super-resolution across diverse scenes, preserving fine details and textures. Best viewed zoomed in. The bottom section highlights the significant reduction in parameters ($10 \times$) and computational cost ($6.5 \times$), allowing our model to process 4K images in just 0.8 seconds—dramatically outpacing existing methods.
  • Figure 2: Overview of PocketSR framework. We replace the original Stable Diffusion variational autoencoder with LiteED, and apply online annealing pruning and multi-layer feature distillation strategies to the diffusion U-Net, effectively reducing model parameters and computational complexity while maintaining excellent super-resolution performance.
  • Figure 3: Analysis of the impact of pruning residual blocks at different depths using the widely adopted RealSR cai2019toward test set, measured by the LPIPS lpips metric. The inference resolution is $512 \times 512$, and we report the maximum inference speed on a single GPU with about 19.5 TFLOPS for FP32.
  • Figure 4: Qualitative results on real-world images. Single-step methods are HTML]9BC6E5highlighted for clarity. PocketSR delivers competitive performance, generating well-preserved structures and fine-grained textures, even when compared to multi-step models.
  • Figure 5: Ablation study of the LiteED design on "Canon 003" from RealSR (top) and "DSC 1286" from DRealSR (bottom).
  • ...and 6 more figures