Table of Contents
Fetching ...

Positional Embedding-Aware Activations

Kathan Shah, Chawin Sitawarin

TL;DR

SPDER addresses the spectral bias of neural implicit representations by introducing a semiperiodic activation $ \sin(x) \cdot \delta(x) $ with a sublinear damping $ \delta(x) $. This enables a simple 5-layer MLP to automatically learn positional embeddings while preserving coordinate magnitudes, leading to dramatic improvements in fitting high-frequency content for images, audio, and video. Across DIV2K, audio benchmarks, and video datasets, SPDER achieves state-of-the-art or near-state-of-the-art results with substantially faster training and far lower losses than prior INR methods like SIREN, all without hyperparameter tuning. The approach also enables useful downstream capabilities such as high-quality super-resolution, gradient representation, and frame interpolation, highlighting its practical impact for efficient, faithful frequency-domain representations and potential reductions in model complexity for media compression and reconstruction.

Abstract

We present a neural network architecture designed to naturally learn a positional embedding and overcome the spectral bias towards lower frequencies faced by conventional activation functions. Our proposed architecture, SPDER, is a simple MLP that uses an activation function composed of a sinusoidal multiplied by a sublinear function, called the damping function. The sinusoidal enables the network to automatically learn the positional embedding of an input coordinate while the damping passes on the actual coordinate value by preventing it from being projected down to within a finite range of values. Our results indicate that SPDERs speed up training by 10x and converge to losses 1,500-50,000x lower than that of the state-of-the-art for image representation. SPDER is also state-of-the-art in audio representation. The superior representation capability allows SPDER to also excel on multiple downstream tasks such as image super-resolution and video frame interpolation. We provide intuition as to why SPDER significantly improves fitting compared to that of other INR methods while requiring no hyperparameter tuning or preprocessing.

Positional Embedding-Aware Activations

TL;DR

SPDER addresses the spectral bias of neural implicit representations by introducing a semiperiodic activation with a sublinear damping . This enables a simple 5-layer MLP to automatically learn positional embeddings while preserving coordinate magnitudes, leading to dramatic improvements in fitting high-frequency content for images, audio, and video. Across DIV2K, audio benchmarks, and video datasets, SPDER achieves state-of-the-art or near-state-of-the-art results with substantially faster training and far lower losses than prior INR methods like SIREN, all without hyperparameter tuning. The approach also enables useful downstream capabilities such as high-quality super-resolution, gradient representation, and frame interpolation, highlighting its practical impact for efficient, faithful frequency-domain representations and potential reductions in model complexity for media compression and reconstruction.

Abstract

We present a neural network architecture designed to naturally learn a positional embedding and overcome the spectral bias towards lower frequencies faced by conventional activation functions. Our proposed architecture, SPDER, is a simple MLP that uses an activation function composed of a sinusoidal multiplied by a sublinear function, called the damping function. The sinusoidal enables the network to automatically learn the positional embedding of an input coordinate while the damping passes on the actual coordinate value by preventing it from being projected down to within a finite range of values. Our results indicate that SPDERs speed up training by 10x and converge to losses 1,500-50,000x lower than that of the state-of-the-art for image representation. SPDER is also state-of-the-art in audio representation. The superior representation capability allows SPDER to also excel on multiple downstream tasks such as image super-resolution and video frame interpolation. We provide intuition as to why SPDER significantly improves fitting compared to that of other INR methods while requiring no hyperparameter tuning or preprocessing.
Paper Structure (45 sections, 20 equations, 20 figures, 4 tables)

This paper contains 45 sections, 20 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: SPDERs learn neural representations of image magnitudes better than other MLP techniques. The 256$\times$256 ground truth was trained on SPDER (left) and SIREN (center), our baseline, for the same number of steps. By simply applying a damping factor to the sine nonlinearity, the loss goes down by nearly 12,000$\times$, and training speeds up by 10$\times$.
  • Figure 2: Six possible SPDER nonlinearities. Note for $\delta(x)$$=$$\arctan(x)$ (top center), the curve converges to $\pm\frac{\pi}{2} \sin(x)$ for $x$ sufficiently far from 0. For $\delta(x)$$=$$\sqrt{|x|}$ (top right), the magnitude of the curve grows relatively slowly, and it resembles two parabolas oriented sideways. For $\delta(x)$$=$$x$ (bottom center) and $\delta(x)$$=$$x^2$ (bottom right), we emphasize how the y-axis is significantly stretched out and the modulations drown out the periodicity. For these reasons, sublinear $\delta$ is desirable.
  • Figure 3: For $\delta(x)$$=$$1$ (left), the activation values distinctly peak at local extrema of $\sin(x)$ at $\pm1$. For $\delta(x)$$=$$\sqrt{|x|}$ (right), they peak at local extrema of $\sin(x) \cdot \sqrt{|x|}$ at $\pm1.31$, $\pm2.18$, $\pm2.81$, $\pm3.31$, etc. Using solely $\sin(x)$ as a nonlinearity discards information about the magnitude of the input, unlike in a conventional SPDER network.
  • Figure 4: (Left) SPDER's reconstruction of the image from skimage.data.cat() after training for 500 steps (taking $\sim$55 seconds) is shown. Notice it has no aliases. (Right) The single-channel log-loss curves for various activation functions on the same image are shown. By the end, SPDER has more than 100,000$\times$ improvement compared to SIREN. No configuration of positional encoding, network size, hierarchy, etc. from literature has been shown to match a simple SPDER on this task.
  • Figure 5: 5-layer networks were trained on the image from skimage.data.text() (1st from left). The gradients of their reconstructions are shown. Note how SPDER represents the edges well within merely 10 training steps (2nd), especially w.r.t. SIREN and ReLU with positional encoding (4th, 6th). After 25 steps, SPDER has captured the exact edges of the image (3rd).
  • ...and 15 more figures

Theorems & Definitions (2)

  • proof
  • proof