Table of Contents
Fetching ...

Learning High-Frequency Functions Made Easy with Sinusoidal Positional Encoding

Chuanhao Sun, Zhihang Yuan, Kai Xu, Luo Mai, N. Siddharth, Shuo Chen, Mahesh K. Marina

TL;DR

This work tackles the brittleness of Fourier-feature-based positional encodings (PE) in learning high-frequency functions due to fixed frequency choices and hyperparameter sensitivity. It introduces Sinusoidal Positional Encoding (SPE), a trainable, adaptive-frequency PE defined by $\mathrm{SPE}(\mathbf{x})=\sin(\boldsymbol{\omega}\mathrm{PE}(\mathbf{x}))$, which can be plugged into existing architectures with minimal changes. Across few-view NeRF, Text-to-Speech, and NTK-based 1D regression, SPE yields higher fidelity and faster convergence without task-specific tuning, and it provides quantitative metrics (WDPR, RWDE) to assess learned frequency content. The results demonstrate SPE as a robust, generalizable tool for efficient high-frequency learning in data-limited regimes, with broad practical impact for rendering, synthesis, and regression tasks.

Abstract

Fourier features based positional encoding (PE) is commonly used in machine learning tasks that involve learning high-frequency features from low-dimensional inputs, such as 3D view synthesis and time series regression with neural tangent kernels. Despite their effectiveness, existing PEs require manual, empirical adjustment of crucial hyperparameters, specifically the Fourier features, tailored to each unique task. Further, PEs face challenges in efficiently learning high-frequency functions, particularly in tasks with limited data. In this paper, we introduce sinusoidal PE (SPE), designed to efficiently learn adaptive frequency features closely aligned with the true underlying function. Our experiments demonstrate that SPE, without hyperparameter tuning, consistently achieves enhanced fidelity and faster training across various tasks, including 3D view synthesis, Text-to-Speech generation, and 1D regression. SPE is implemented as a direct replacement for existing PEs. Its plug-and-play nature lets numerous tasks easily adopt and benefit from SPE.

Learning High-Frequency Functions Made Easy with Sinusoidal Positional Encoding

TL;DR

This work tackles the brittleness of Fourier-feature-based positional encodings (PE) in learning high-frequency functions due to fixed frequency choices and hyperparameter sensitivity. It introduces Sinusoidal Positional Encoding (SPE), a trainable, adaptive-frequency PE defined by , which can be plugged into existing architectures with minimal changes. Across few-view NeRF, Text-to-Speech, and NTK-based 1D regression, SPE yields higher fidelity and faster convergence without task-specific tuning, and it provides quantitative metrics (WDPR, RWDE) to assess learned frequency content. The results demonstrate SPE as a robust, generalizable tool for efficient high-frequency learning in data-limited regimes, with broad practical impact for rendering, synthesis, and regression tasks.

Abstract

Fourier features based positional encoding (PE) is commonly used in machine learning tasks that involve learning high-frequency features from low-dimensional inputs, such as 3D view synthesis and time series regression with neural tangent kernels. Despite their effectiveness, existing PEs require manual, empirical adjustment of crucial hyperparameters, specifically the Fourier features, tailored to each unique task. Further, PEs face challenges in efficiently learning high-frequency functions, particularly in tasks with limited data. In this paper, we introduce sinusoidal PE (SPE), designed to efficiently learn adaptive frequency features closely aligned with the true underlying function. Our experiments demonstrate that SPE, without hyperparameter tuning, consistently achieves enhanced fidelity and faster training across various tasks, including 3D view synthesis, Text-to-Speech generation, and 1D regression. SPE is implemented as a direct replacement for existing PEs. Its plug-and-play nature lets numerous tasks easily adopt and benefit from SPE.
Paper Structure (31 sections, 3 theorems, 33 equations, 15 figures, 7 tables)

This paper contains 31 sections, 3 theorems, 33 equations, 15 figures, 7 tables.

Key Result

Theorem 3.1

$L$ determines the approximation accuracy of SPE to a trainable PE (proof in §spelearn).

Figures (15)

  • Figure 1: New view generation in NeRF with 8 input views on Blender dataset mildenhall2021nerf. $L_r$ is the number of components taken when processing coordinates in PE and $L_d$ for the direction processing in PE.
  • Figure 2: The Optimal PE for NeRF on Blender dataset only has negligible influence on speech generation with FastSpeech, while our method achieves better alignment of the red regions with the ground truth.
  • Figure 3: Objects generated by APENeRF, InstantNeRF and our method. APENeRF uses hash encoding and it is hard to train with 8 views on the Blender synthetic dataset whereas ours (i.e., SPE), even with limited data, can already achieve high-quality generation close to the ground truth.
  • Figure 4: Learned features by SPE in different objects in the Blender dataset. A learned feature rarely goes beyond $L=9$, and therefore set $L=10$ is the optimal configuration for PE. $\omega^*$ is the feature: $\omega^* = \omega\cdot2^{l-1}, l\in\{1, 2, \dots, L\}$.
  • Figure 5: Example of implementing SPE in NeRF: Using periodic activation function for Frequency Encoded series. (x,y,z) is the coordinates of the object. Dir indicates direction of the view.
  • ...and 10 more figures

Theorems & Definitions (5)

  • Theorem 3.1
  • Theorem 3.2
  • proof
  • Remark 3.3
  • Theorem 3.4