LIPT: Latency-aware Image Processing Transformer
Junbo Qiao, Wei Li, Haizhen Xie, Hanting Chen, Yunshuai Zhou, Zhijun Tu, Jie Hu, Shaohui Lin
TL;DR
This work tackles the gap between FLOPs-based efficiency and practical latency in image processing transformers. It introduces LIPT, a latency-aware transformer that replaces a portion of memory-intensive self-attention with lightweight convolutions in a two-level block design, and augments it with NVSM-SA to capture long-range context without extra cost and HRM to boost high-frequency detail. Through training-time multi-branch reparameterization and eventual inference-time Rep-Conv simplification, LIPT achieves real-time GPU inference while delivering competitive PSNR/SSIM across image super-resolution, JPEG artifact reduction, and denoising tasks. The proposed NVSM-SA and HRM are shown to yield meaningful gains in quality and latency, with extensive ablations confirming the importance of both long-range and high-frequency components. Overall, LIPT demonstrates a practical pathway to deploy high-performance transformers in real-time low-level vision applications, reducing memory access overhead and achieving state-of-the-art latency-PL performance on several benchmarks.
Abstract
Transformer is leading a trend in the field of image processing. Despite the great success that existing lightweight image processing transformers have achieved, they are tailored to FLOPs or parameters reduction, rather than practical inference acceleration. In this paper, we present a latency-aware image processing transformer, termed LIPT. We devise the low-latency proportion LIPT block that substitutes memory-intensive operators with the combination of self-attention and convolutions to achieve practical speedup. Specifically, we propose a novel non-volatile sparse masking self-attention (NVSM-SA) that utilizes a pre-computing sparse mask to capture contextual information from a larger window with no extra computation overload. Besides, a high-frequency reparameterization module (HRM) is proposed to make LIPT block reparameterization friendly, which improves the model's detail reconstruction capability. Extensive experiments on multiple image processing tasks (e.g., image super-resolution (SR), JPEG artifact reduction, and image denoising) demonstrate the superiority of LIPT on both latency and PSNR. LIPT achieves real-time GPU inference with state-of-the-art performance on multiple image SR benchmarks.
