Table of Contents
Fetching ...

LIPT: Latency-aware Image Processing Transformer

Junbo Qiao, Wei Li, Haizhen Xie, Hanting Chen, Yunshuai Zhou, Zhijun Tu, Jie Hu, Shaohui Lin

TL;DR

This work tackles the gap between FLOPs-based efficiency and practical latency in image processing transformers. It introduces LIPT, a latency-aware transformer that replaces a portion of memory-intensive self-attention with lightweight convolutions in a two-level block design, and augments it with NVSM-SA to capture long-range context without extra cost and HRM to boost high-frequency detail. Through training-time multi-branch reparameterization and eventual inference-time Rep-Conv simplification, LIPT achieves real-time GPU inference while delivering competitive PSNR/SSIM across image super-resolution, JPEG artifact reduction, and denoising tasks. The proposed NVSM-SA and HRM are shown to yield meaningful gains in quality and latency, with extensive ablations confirming the importance of both long-range and high-frequency components. Overall, LIPT demonstrates a practical pathway to deploy high-performance transformers in real-time low-level vision applications, reducing memory access overhead and achieving state-of-the-art latency-PL performance on several benchmarks.

Abstract

Transformer is leading a trend in the field of image processing. Despite the great success that existing lightweight image processing transformers have achieved, they are tailored to FLOPs or parameters reduction, rather than practical inference acceleration. In this paper, we present a latency-aware image processing transformer, termed LIPT. We devise the low-latency proportion LIPT block that substitutes memory-intensive operators with the combination of self-attention and convolutions to achieve practical speedup. Specifically, we propose a novel non-volatile sparse masking self-attention (NVSM-SA) that utilizes a pre-computing sparse mask to capture contextual information from a larger window with no extra computation overload. Besides, a high-frequency reparameterization module (HRM) is proposed to make LIPT block reparameterization friendly, which improves the model's detail reconstruction capability. Extensive experiments on multiple image processing tasks (e.g., image super-resolution (SR), JPEG artifact reduction, and image denoising) demonstrate the superiority of LIPT on both latency and PSNR. LIPT achieves real-time GPU inference with state-of-the-art performance on multiple image SR benchmarks.

LIPT: Latency-aware Image Processing Transformer

TL;DR

This work tackles the gap between FLOPs-based efficiency and practical latency in image processing transformers. It introduces LIPT, a latency-aware transformer that replaces a portion of memory-intensive self-attention with lightweight convolutions in a two-level block design, and augments it with NVSM-SA to capture long-range context without extra cost and HRM to boost high-frequency detail. Through training-time multi-branch reparameterization and eventual inference-time Rep-Conv simplification, LIPT achieves real-time GPU inference while delivering competitive PSNR/SSIM across image super-resolution, JPEG artifact reduction, and denoising tasks. The proposed NVSM-SA and HRM are shown to yield meaningful gains in quality and latency, with extensive ablations confirming the importance of both long-range and high-frequency components. Overall, LIPT demonstrates a practical pathway to deploy high-performance transformers in real-time low-level vision applications, reducing memory access overhead and achieving state-of-the-art latency-PL performance on several benchmarks.

Abstract

Transformer is leading a trend in the field of image processing. Despite the great success that existing lightweight image processing transformers have achieved, they are tailored to FLOPs or parameters reduction, rather than practical inference acceleration. In this paper, we present a latency-aware image processing transformer, termed LIPT. We devise the low-latency proportion LIPT block that substitutes memory-intensive operators with the combination of self-attention and convolutions to achieve practical speedup. Specifically, we propose a novel non-volatile sparse masking self-attention (NVSM-SA) that utilizes a pre-computing sparse mask to capture contextual information from a larger window with no extra computation overload. Besides, a high-frequency reparameterization module (HRM) is proposed to make LIPT block reparameterization friendly, which improves the model's detail reconstruction capability. Extensive experiments on multiple image processing tasks (e.g., image super-resolution (SR), JPEG artifact reduction, and image denoising) demonstrate the superiority of LIPT on both latency and PSNR. LIPT achieves real-time GPU inference with state-of-the-art performance on multiple image SR benchmarks.
Paper Structure (25 sections, 13 equations, 13 figures, 9 tables, 1 algorithm)

This paper contains 25 sections, 13 equations, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: (a) Performance on Urban100 for $\times 2$ SR. The larger circles present larger computation costs on FLOPs, which however is not directly proportional to the practical running latency. (b) The inference time of MSA, MLP and Conv in the low-level Transformers. The total inference time on GPU (CPU) of SwinIR-Light, N-Gram and LIPT-Small are 756ms (6.3S), 479ms (9.2S) and 99ms (2.8S), respectively.
  • Figure 2: Illustration of the proposed LIPT (taking the SR task and expansion window size $s=2$ for example). (a) LIPT block with low-latency proportion MSA-Conv. (b) NVSM-SA is proposed to capture contextual information from the larger window with no extra computation cost, where SLWA and DLWA denote sparse large and dense local window attentions, respectively. (c) HRM is developed to improve the model's detail reconstruction capability.
  • Figure 3: (a) Illustration of different non-volatile/volatile sampling masks (NVM/VM). $\beta$ denotes the non-volatility drop rate. (b) Sampling process of three masks, where the original window size $p$ and expansion window size $s$ are set to 4 and 2, respectively.
  • Figure 4: Maximum memory allocation during inference on DIV2K validation set. Statistics are collected following the implementation of zhang2019aim.
  • Figure 5: Qualitative Comparison on the "img098" image of Urban100 for $\times$4 SR.
  • ...and 8 more figures