Table of Contents
Fetching ...

Implicit Neural Representation for Video and Image Super-Resolution

Mary Aiyetigbo, Wanqi Yuan, Feng Luo, Nianyi Li

TL;DR

SR-INR introduces a unified implicit neural representation for both image and video super-resolution, reconstructing $I_{hr}$ from $I_{lr}$ via a high-resolution grid $\oldsymbol{\mathcal{G}}_{hr}$ and multi-resolution hash-encoded texture features. By combining texture encoding, implicit hashing over a 6D latent space, and a top-down attention mechanism, the method captures spatial and temporal details without explicit motion estimation. A Pixel-Error Amplified Loss (PEA-loss) further refines fine details while mitigating over-smoothing. Across image and video benchmarks, SR-INR delivers competitive or superior results with a simpler, more efficient architecture, and ablations reveal favorable trade-offs guiding architectural choices. This unified INR-based approach highlights the potential of grid-based, hash-encoded representations for scalable, temporally stable SR in real-world applications.

Abstract

We present a novel approach for super-resolution that utilizes implicit neural representation (INR) to effectively reconstruct and enhance low-resolution videos and images. By leveraging the capacity of neural networks to implicitly encode spatial and temporal features, our method facilitates high-resolution reconstruction using only low-resolution inputs and a 3D high-resolution grid. This results in an efficient solution for both image and video super-resolution. Our proposed method, SR-INR, maintains consistent details across frames and images, achieving impressive temporal stability without relying on the computationally intensive optical flow or motion estimation typically used in other video super-resolution techniques. The simplicity of our approach contrasts with the complexity of many existing methods, making it both effective and efficient. Experimental evaluations show that SR-INR delivers results on par with or superior to state-of-the-art super-resolution methods, while maintaining a more straightforward structure and reduced computational demands. These findings highlight the potential of implicit neural representations as a powerful tool for reconstructing high-quality, temporally consistent video and image signals from low-resolution data.

Implicit Neural Representation for Video and Image Super-Resolution

TL;DR

SR-INR introduces a unified implicit neural representation for both image and video super-resolution, reconstructing from via a high-resolution grid and multi-resolution hash-encoded texture features. By combining texture encoding, implicit hashing over a 6D latent space, and a top-down attention mechanism, the method captures spatial and temporal details without explicit motion estimation. A Pixel-Error Amplified Loss (PEA-loss) further refines fine details while mitigating over-smoothing. Across image and video benchmarks, SR-INR delivers competitive or superior results with a simpler, more efficient architecture, and ablations reveal favorable trade-offs guiding architectural choices. This unified INR-based approach highlights the potential of grid-based, hash-encoded representations for scalable, temporally stable SR in real-world applications.

Abstract

We present a novel approach for super-resolution that utilizes implicit neural representation (INR) to effectively reconstruct and enhance low-resolution videos and images. By leveraging the capacity of neural networks to implicitly encode spatial and temporal features, our method facilitates high-resolution reconstruction using only low-resolution inputs and a 3D high-resolution grid. This results in an efficient solution for both image and video super-resolution. Our proposed method, SR-INR, maintains consistent details across frames and images, achieving impressive temporal stability without relying on the computationally intensive optical flow or motion estimation typically used in other video super-resolution techniques. The simplicity of our approach contrasts with the complexity of many existing methods, making it both effective and efficient. Experimental evaluations show that SR-INR delivers results on par with or superior to state-of-the-art super-resolution methods, while maintaining a more straightforward structure and reduced computational demands. These findings highlight the potential of implicit neural representations as a powerful tool for reconstructing high-quality, temporally consistent video and image signals from low-resolution data.

Paper Structure

This paper contains 18 sections, 17 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Visual comparison of image details in SOTA super-resolution. The leftmost column shows the ground truth, followed from left to right by the results of WGSR korkmaz2024training, EDSR edsr, LIIF chen2021liif, and our method, with an upsampling scale of 32.
  • Figure 2: SR-INR pipeline for super-resolution. Local patches are extracted at multiple resolutions and processed by MLPs to generate feature vectors. These vectors are concatenated, refined via a top-down attention mechanism, and fed into an MLP to predict the RGB value, resulting in the super-resolved output.
  • Figure 3: Generalization capability of our method. Trained with an upsampling scale of $\times$32, our method is subsequently applied to super-resolution tasks at scales of $\times$2, $\times$4, $\times$8, $\times$16, $\times$64, and $\times$128.
  • Figure 4: Visual comparison of SOTA in Image SR. From top to bottom, the datasets displayed are CelebA-HQ, DIV2K, Set5, and Set14. The leftmost column shows the input images, while the rightmost column displays the ground truth. The upsampling scale is set to $\times$32.
  • Figure 5: This image shows the results of VRT and Ours. From top to bottom, the datasets displayed are Vid4 and GOPRO datasets.
  • ...and 1 more figures