Table of Contents
Fetching ...

PIP: Positional-encoding Image Prior

Nimrod Shabtay, Eli Schwartz, Raja Giryes

TL;DR

This work reframes Deep Image Prior as an implicit neural representation and introduces Positional Encoding Image Prior (PIP), which substitutes DIP’s random latent input with Fourier-feature encodings and replaces convolutional layers with per-coordinate MLPs. Through this reparameterization, PIP achieves similar denoising and super-resolution performance to DIP but with far fewer parameters, and extends naturally to video via 3D Fourier features, delivering superior temporal consistency compared with 3D-DIP and other INR approaches. The paper investigates spectral bias, compares architectural variants (CNN vs. MLP; fixed vs. learned frequencies), and demonstrates broad applicability, including inpainting, dehazing, and CLIP inversion. The results suggest Fourier-feature based positional encoding as a powerful, flexible prior for image and video restoration, with implications for implicit representations and NeRF-like multitask generalization.

Abstract

In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN's internal image-prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random or learned latent with Fourier-Features (Positional Encoding). We show that thanks to the Fourier features properties, we can replace the convolution layers with simple pixel-level MLPs. We name this scheme ``Positional Encoding Image Prior" (PIP) and exhibit that it performs very similarly to DIP on various image-reconstruction tasks with much less parameters required. Additionally, we demonstrate that PIP can be easily extended to videos, where 3D-DIP struggles and suffers from instability. Code and additional examples for all tasks, including videos, are available on the project page https://nimrodshabtay.github.io/PIP/

PIP: Positional-encoding Image Prior

TL;DR

This work reframes Deep Image Prior as an implicit neural representation and introduces Positional Encoding Image Prior (PIP), which substitutes DIP’s random latent input with Fourier-feature encodings and replaces convolutional layers with per-coordinate MLPs. Through this reparameterization, PIP achieves similar denoising and super-resolution performance to DIP but with far fewer parameters, and extends naturally to video via 3D Fourier features, delivering superior temporal consistency compared with 3D-DIP and other INR approaches. The paper investigates spectral bias, compares architectural variants (CNN vs. MLP; fixed vs. learned frequencies), and demonstrates broad applicability, including inpainting, dehazing, and CLIP inversion. The results suggest Fourier-feature based positional encoding as a powerful, flexible prior for image and video restoration, with implications for implicit representations and NeRF-like multitask generalization.

Abstract

In Deep Image Prior (DIP), a Convolutional Neural Network (CNN) is fitted to map a latent space to a degraded (e.g. noisy) image but in the process learns to reconstruct the clean image. This phenomenon is attributed to CNN's internal image-prior. We revisit the DIP framework, examining it from the perspective of a neural implicit representation. Motivated by this perspective, we replace the random or learned latent with Fourier-Features (Positional Encoding). We show that thanks to the Fourier features properties, we can replace the convolution layers with simple pixel-level MLPs. We name this scheme ``Positional Encoding Image Prior" (PIP) and exhibit that it performs very similarly to DIP on various image-reconstruction tasks with much less parameters required. Additionally, we demonstrate that PIP can be easily extended to videos, where 3D-DIP struggles and suffers from instability. Code and additional examples for all tasks, including videos, are available on the project page https://nimrodshabtay.github.io/PIP/
Paper Structure (9 sections, 2 theorems, 13 equations, 20 figures, 8 tables)

This paper contains 9 sections, 2 theorems, 13 equations, 20 figures, 8 tables.

Key Result

Proposition 1

For $E$ being the $\ell_2$ loss and $f(z) = h*z$, where $h$ is a convolution kernel, eq:reparametrization is equivalent to an element-wise optimization with Fourier features.

Figures (20)

  • Figure 1: We offer a novel view of DIP as an implicit model that maps noise to RGB values (left). Although it maps noise to a degraded image, DIP produces a clean image. We suggest that this image prior, or regularization, stems from the fact that due to the convolutional structure of the DIP architecture neighboring pixels in the output image (blue and orange in the picture) are a function of almost the same noise box in the input but a bit shifted. With this implicit model perspective, we suggest that one may achieve a similar 'image prior' effect by replacing the input noise with Fourier-Features. We also prove equivalence in the linear network case. As a result, we may use a simple pixel-level MLP that has much fewer parameters than the DIP CNN and still produces denoised images of the same quality. Remarkably, for video, this leads to significant improvement.
  • Figure 2: Image denoising examples for Gaussian noise ($\sigma=25$). From left to right: clean image (GT), noisy image, DIP (CNN) and PIP (MLP) results. The results for DIP and PIP are very similar, suggesting they follow a similar image prior.
  • Figure 3: SR examples - First example is for $\times 4$ SR; Second is for $\times 8$ SR. The results for DIP (CNN) and PIP (MLP) are very similar, suggesting they have a similar image prior.
  • Figure 4: SIREN vs. PIP frequency hyper-parameter effect. We apply different Band-Width limits to SIREN (different $\omega_0$ values) and PIP (different $F_{max}$ values), and evaluate the reconstruction results over the "standard dataset". PIP produce over-smooth images when the max frequency is low but converge to a steady plateau as the frequency range gets higher as oppose to SIREN where we can see an optimal point around $\omega_0=2^5=32$ but the performance drops as the frequency range move away from the optimal point.
  • Figure 5: PIP architecture. Our architecture is based on DIP common architecture of a Unet with skip-connections.
  • ...and 15 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 1
  • proof