Table of Contents
Fetching ...

FIPER: Factorized Features for Robust Image Super-Resolution and Compression

Yang-Che Sun, Cheng Yu Yeo, Ernie Chu, Jun-Cheng Chen, Yu-Lun Liu

TL;DR

This work tackles the unification of low-level vision tasks by introducing Factorized Features, a basis–coefficient representation that explicitly encodes multi-scale frequencies. The approach comprises a Coefficient Backbone and a Basis Swin Transformer to generate spatially varying coefficients and content-aware bases, enabling accurate reconstruction via a coordinated sampling and reconstruction pipeline. By applying this representation to both Single-Image Super-Resolution and Learned Image Compression, and extending it to Multi-Image Compression with a Basis Merging Transformer, the method achieves state-of-the-art rate–distortion performance and substantial PSNR gains, while preserving structural detail such as repetitive patterns. The framework demonstrates the value of a frequency-aware, shared-basis representation for robust image reconstruction and compression, with potential extension to broader low-level vision tasks and video. Key ideas include $f(\mathbf{x}) = \sum_{i=1}^N c_i(\mathbf{x})\, b_i(\mathbf{x})$, learned bases $b_i$, coordinate transforms $\gamma$, and multi-frequency modulation $\{\alpha_j, \psi(\cdot)\}$ that jointly capture high- and low-frequency content.

Abstract

In this work, we propose using a unified representation, termed Factorized Features, for low-level vision tasks, where we test on Single Image Super-Resolution (SISR) and \textbf{Image Compression}. Motivated by the shared principles between these tasks, they require recovering and preserving fine image details, whether by enhancing resolution for SISR or reconstructing compressed data for Image Compression. Unlike previous methods that mainly focus on network architecture, our proposed approach utilizes a basis-coefficient decomposition as well as an explicit formulation of frequencies to capture structural components and multi-scale visual features in images, which addresses the core challenges of both tasks. We replace the representation of prior models from simple feature maps with Factorized Features to validate the potential for broad generalizability. In addition, we further optimize the compression pipeline by leveraging the mergeable-basis property of our Factorized Features, which consolidates shared structures on multi-frame compression. Extensive experiments show that our unified representation delivers state-of-the-art performance, achieving an average relative improvement of 204.4% in PSNR over the baseline in Super-Resolution (SR) and 9.35% BD-rate reduction in Image Compression compared to the previous SOTA. Project page: https://jayisaking.github.io/FIPER/

FIPER: Factorized Features for Robust Image Super-Resolution and Compression

TL;DR

This work tackles the unification of low-level vision tasks by introducing Factorized Features, a basis–coefficient representation that explicitly encodes multi-scale frequencies. The approach comprises a Coefficient Backbone and a Basis Swin Transformer to generate spatially varying coefficients and content-aware bases, enabling accurate reconstruction via a coordinated sampling and reconstruction pipeline. By applying this representation to both Single-Image Super-Resolution and Learned Image Compression, and extending it to Multi-Image Compression with a Basis Merging Transformer, the method achieves state-of-the-art rate–distortion performance and substantial PSNR gains, while preserving structural detail such as repetitive patterns. The framework demonstrates the value of a frequency-aware, shared-basis representation for robust image reconstruction and compression, with potential extension to broader low-level vision tasks and video. Key ideas include , learned bases , coordinate transforms , and multi-frequency modulation that jointly capture high- and low-frequency content.

Abstract

In this work, we propose using a unified representation, termed Factorized Features, for low-level vision tasks, where we test on Single Image Super-Resolution (SISR) and \textbf{Image Compression}. Motivated by the shared principles between these tasks, they require recovering and preserving fine image details, whether by enhancing resolution for SISR or reconstructing compressed data for Image Compression. Unlike previous methods that mainly focus on network architecture, our proposed approach utilizes a basis-coefficient decomposition as well as an explicit formulation of frequencies to capture structural components and multi-scale visual features in images, which addresses the core challenges of both tasks. We replace the representation of prior models from simple feature maps with Factorized Features to validate the potential for broad generalizability. In addition, we further optimize the compression pipeline by leveraging the mergeable-basis property of our Factorized Features, which consolidates shared structures on multi-frame compression. Extensive experiments show that our unified representation delivers state-of-the-art performance, achieving an average relative improvement of 204.4% in PSNR over the baseline in Super-Resolution (SR) and 9.35% BD-rate reduction in Image Compression compared to the previous SOTA. Project page: https://jayisaking.github.io/FIPER/

Paper Structure

This paper contains 39 sections, 15 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: In this work, we propose to represent images by Factorized Features, contrasting with prior methods that either (a) rely on fixed-basis decompositioncosaewavemixsr or (b) introduce various complex model blocks chen2023hatPFT such as sparse attention, dense skip connection, etc. As highlighted in the orange boxes, (a) successfully captures periodic line structures but suffers from lower fidelity, while (b) cannot recover those repeating patterns without explicit frequency modeling. In contrast, our method (c) uses learned decomposition with generalizable bases to deliver high-fidelity reconstruction while preserving periodic structures. We also present LAM LAM and Diffusion Index (DI) LAM for reconstructing the red-boxed patch, with its pixel-importance map shown on the middle row; a higher DI indicates more pixel utilization.
  • Figure 2: Comparison of different trainable basis settings We test on single-image regression with $f(\mathbf{x}) = \sum_{i=1}^{N} c_i(\mathbf{x})\,b_i(\mathbf{x})$ with equal parameter count. (a) fixed sinusoidal basis favoring low–frequency content, \ref{['concise-decom']}; (b) learnable (i.e. requires gradient) coordinate mapping with moderate gains; (c) Learned Non-uniform Basis$M_i\in\mathbb{R}^{T\times T}$ converges more synchronously. Note that $\alpha_1,\alpha_2$ are the frequencies, functionally similar to $u, v$.
  • Figure 3: Visualization of different factorization methods This figure shows the difference between vanilla fields, coordinate transformation, and multi-frequency modulation. Specifically, (a) is a vanilla basis-coefficient field, and (b) adds sawtooth transformation. We can see in (b) that the coordinate transformation explicitly models a patch-like pattern, e.g., the second plot from the top is divided into two periods. By enforcing such frequency components, the models can decompose the signal effectively. Next, (d) is our full formulation, while (c) has no $\alpha$. Intuitively, from (b) to (c), applying $\psi$ (we use $\cos$ here) should not have much effect on the performance since we do not have any constraint on these learnable bases. With the introduction of $\alpha$, the models are forced to attend to signal components of different frequencies.
  • Figure 4: The Proposed Factorized Features. In this figure $\alpha=\{1,16\}$, $\psi=\text{sin}(\cdot)$, and that $\gamma_i(x)$ is coordinate transformation. We can see in this figure with sawtooth transformation $\gamma$ the feature is prompted to learn pattern from different scale in a patch-like manner, e.g. $\gamma_2$ makes the same basis repeat four times in the grid. However, only $\gamma$ is not enough for very fine details; therefore, we introduce $\alpha$ and $\psi$. Each basis is forced to accommodate both high-frequency (large $\alpha$ ) and low-frequency (small $\alpha$) components together with periodic functions $\psi$, i.e., larger $\alpha$ will make $\psi$ oscillate sharper.With explicit formulation of repetitive patterns and high frequency components, images can be represented with fine-grained details, which is especially useful for low-level vision tasks such as Super-Resolution and Image Compression.
  • Figure 5: Super-Resolution and Image Compression with Factorized Features. This figure illustrates how Factorized Features are used in Super-Resolution and Image Compression. (a) Given a low-resolution input image, we first extract $X_{coeff}$ feature with Coefficient Backbone. Next, we generate the basis and coefficient with Basis Swin Transformer and convolution layers, respectively, from the same $X_{coeff}$. Finally, the prediction is reconstructed by Factorized Features Reconstruction. (b) To decrease distortion in image compression, we replace the synthesis transform of the traditional learned image compression pipeline with (a) by aligning spatial resolution and latent channels.
  • ...and 8 more figures