Table of Contents
Fetching ...

RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution

Youngwan Jin, Incheol Park, Yagiz Nalcakan, Hyeongjin Ju, Sanghyeop Yeo, Shiho Kim

TL;DR

The Regional Prior attention Transformer for infrared image Super-Resolution (RPT-SR) is proposed, a novel architecture that explicitly encodes scene layout information into the attention mechanism and allows the priors to dynamically modulate the local reconstruction process.

Abstract

General-purpose super-resolution models, particularly Vision Transformers, have achieved remarkable success but exhibit fundamental inefficiencies in common infrared imaging scenarios like surveillance and autonomous driving, which operate from fixed or nearly-static viewpoints. These models fail to exploit the strong, persistent spatial priors inherent in such scenes, leading to redundant learning and suboptimal performance. To address this, we propose the Regional Prior attention Transformer for infrared image Super-Resolution (RPT-SR), a novel architecture that explicitly encodes scene layout information into the attention mechanism. Our core contribution is a dual-token framework that fuses (1) learnable, regional prior tokens, which act as a persistent memory for the scene's global structure, with (2) local tokens that capture the frame-specific content of the current input. By utilizing these tokens into an attention, our model allows the priors to dynamically modulate the local reconstruction process. Extensive experiments validate our approach. While most prior works focus on a single infrared band, we demonstrate the broad applicability and versatility of RPT-SR by establishing new state-of-the-art performance across diverse datasets covering both Long-Wave (LWIR) and Short-Wave (SWIR) spectra

RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution

TL;DR

The Regional Prior attention Transformer for infrared image Super-Resolution (RPT-SR) is proposed, a novel architecture that explicitly encodes scene layout information into the attention mechanism and allows the priors to dynamically modulate the local reconstruction process.

Abstract

General-purpose super-resolution models, particularly Vision Transformers, have achieved remarkable success but exhibit fundamental inefficiencies in common infrared imaging scenarios like surveillance and autonomous driving, which operate from fixed or nearly-static viewpoints. These models fail to exploit the strong, persistent spatial priors inherent in such scenes, leading to redundant learning and suboptimal performance. To address this, we propose the Regional Prior attention Transformer for infrared image Super-Resolution (RPT-SR), a novel architecture that explicitly encodes scene layout information into the attention mechanism. Our core contribution is a dual-token framework that fuses (1) learnable, regional prior tokens, which act as a persistent memory for the scene's global structure, with (2) local tokens that capture the frame-specific content of the current input. By utilizing these tokens into an attention, our model allows the priors to dynamically modulate the local reconstruction process. Extensive experiments validate our approach. While most prior works focus on a single infrared band, we demonstrate the broad applicability and versatility of RPT-SR by establishing new state-of-the-art performance across diverse datasets covering both Long-Wave (LWIR) and Short-Wave (SWIR) spectra
Paper Structure (26 sections, 9 equations, 4 figures, 4 tables)

This paper contains 26 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of attention mechanisms. (a) Standard self-attention computes global relationships at a high computational cost. (b) Window self-attention limits computation to local windows but misses global context. (c) Our proposed Regional Prior Attention (RPA) fuses a persistent, learnable Regional Prior (R.P.) token with a Local token. The R.P. token learns the scene's static layout over epochs, providing a strong structural guide for the reconstruction.
  • Figure 2: The overall architecture of our proposed RPT-SR. (Top) The model consists of a shallow feature stem, a deep body of RPA Blocks, and a reconstruction head. (Bottom Left) A detailed view of the Regional Prior Attention (RPA) module. A dynamic Local Token, summarized from the input, is fused with a learnable, static Regional Prior Token. These are processed by a Hierarchical Attention (HAT) block to guide the reconstruction. (Bottom Right) The hierarchical windowing strategy, where the attention window size increases in deeper layers.
  • Figure 3: Qualitative comparison for $\times$4 super-resolution on the M3FD dataset. Our method (RPT-SR) reconstructs sharper details and more plausible textures compared to existing state-of-the-art methods, particularly in restoring fine structures like human figures, building facades, and distant objects.
  • Figure 4: Attention maps at the last RPA layer from the M3FD test set. From left to right we show the local-only baseline, the static-prior variant, and our full model.