Table of Contents
Fetching ...

Frequency and Generalisation of Periodic Activation Functions in Reinforcement Learning

Augustine N. Mavor-Parker, Matthew J. Sargent, Caswell Barry, Lewis Griffin, Clare Lyle

TL;DR

This work empirically analyzes learned Fourier features in off-policy reinforcement learning, focusing on whether improvements arise from high-frequency expressivity or low-frequency generalization. By comparing LFF and CLFF within SAC on the DeepMind Control Suite, the authors show that periodic representations consistently converge to high frequencies largely independent of initialization, and that their generalization benefits erode under input noise due to increased brittleness and higher effective rank. Weight decay is proposed as a practical regularizer that partially offsets overfitting and maintains faster learning while improving robustness. The findings suggest a trade-off between expressiveness and generalization, motivating adaptive architectures that can modulate frequency according to state novelty or perturbations.

Abstract

Periodic activation functions, often referred to as learned Fourier features have been widely demonstrated to improve sample efficiency and stability in a variety of deep RL algorithms. Potentially incompatible hypotheses have been made about the source of these improvements. One is that periodic activations learn low frequency representations and as a result avoid overfitting to bootstrapped targets. Another is that periodic activations learn high frequency representations that are more expressive, allowing networks to quickly fit complex value functions. We analyse these claims empirically, finding that periodic representations consistently converge to high frequencies regardless of their initialisation frequency. We also find that while periodic activation functions improve sample efficiency, they exhibit worse generalization on states with added observation noise -- especially when compared to otherwise equivalent networks with ReLU activation functions. Finally, we show that weight decay regularization is able to partially offset the overfitting of periodic activation functions, delivering value functions that learn quickly while also generalizing.

Frequency and Generalisation of Periodic Activation Functions in Reinforcement Learning

TL;DR

This work empirically analyzes learned Fourier features in off-policy reinforcement learning, focusing on whether improvements arise from high-frequency expressivity or low-frequency generalization. By comparing LFF and CLFF within SAC on the DeepMind Control Suite, the authors show that periodic representations consistently converge to high frequencies largely independent of initialization, and that their generalization benefits erode under input noise due to increased brittleness and higher effective rank. Weight decay is proposed as a practical regularizer that partially offsets overfitting and maintains faster learning while improving robustness. The findings suggest a trade-off between expressiveness and generalization, motivating adaptive architectures that can modulate frequency according to state novelty or perturbations.

Abstract

Periodic activation functions, often referred to as learned Fourier features have been widely demonstrated to improve sample efficiency and stability in a variety of deep RL algorithms. Potentially incompatible hypotheses have been made about the source of these improvements. One is that periodic activations learn low frequency representations and as a result avoid overfitting to bootstrapped targets. Another is that periodic activations learn high frequency representations that are more expressive, allowing networks to quickly fit complex value functions. We analyse these claims empirically, finding that periodic representations consistently converge to high frequencies regardless of their initialisation frequency. We also find that while periodic activation functions improve sample efficiency, they exhibit worse generalization on states with added observation noise -- especially when compared to otherwise equivalent networks with ReLU activation functions. Finally, we show that weight decay regularization is able to partially offset the overfitting of periodic activation functions, delivering value functions that learn quickly while also generalizing.
Paper Structure (26 sections, 5 equations, 14 figures, 10 tables)

This paper contains 26 sections, 5 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: (a) Learned Fourier feature layers are fully-connected layers with a $\sin$ activation. Larger weight magnitudes lead to faster oscillations as shown by the orange line, while smaller weights lead to slower oscillations as shown by the blue lines. (b) Results of supervised learning warm up experiment using high frequency Fourier features as well as ReLU features to fit the dynamic programming ground truth of the mountain car value function moore1990efficient. Top left shows a segment of the ground truth value function from the mountain car environment. Top right shows the training/test distributions used by the ReLU and high frequency Fourier feature architectures: light is the training set, dark is the test set. The ReLU network smoothly fits the training regions and generalizes well to the test regions (bottom left). High frequency Fourier features fit the training distribution more precisely but do not generalize to the test regions (bottom right). See supplementary \ref{['app:supervised_learning_warmup']} for details.
  • Figure 2: Regardless of initialization and architecture (LFF vs CLFF), Fourier features converge to similar frequencies. (a) shows return early in training with different initial $\beta$'s, demonstrating that periodic activation functions improve performance at a range of initialization frequencies. (b) shows that the final scale $\beta$ of learned Fourier features (i.e. the frequency learned) is similar regardless of the initialization frequency. (c) shows the distribution of cycles (which is proportional to $\beta$) for different initial $\beta$'s. Error bars represent standard deviation over five seeds in panels (a) and (b).
  • Figure 3: Learned Fourier features perform either as good or worse than ReLU features when input observations are perturbed with medium noise. We apply three different levels of noise to observations at test time---in the medium noise case we find that architectures with learned Fourier feature activations generally either as good or sometimes worse (like in walker-run) than ReLU architectures. Results are reported across ten seeds, shaded region indicates standard deviation. Evaluation at different noise levels is performed at 100k increments throughout training.
  • Figure 4: Periodic activations are more brittle than ReLUs. (a) shows that $\sin$ pre and post-activations become less similar over time more quickly than ReLUs. (b) shows a similar trend in cosine similarity of activations before and after a medium level of noise is applied ($\sigma=0.625$). Qualitatively, LFF activations have their points shifted more dramatically than ReLU activations when inputs are perturbed (panel (c)). Results are reported across ten seeds. The shaded regions represent the standard deviation.
  • Figure 5: Weight decay is able to partially offset overfitting at a medium level of observation noise, as introduced in section \ref{['sec:fixing_generalization']}, in some environments. In quadruped-run, weight decay allows learned Fourier features to generalize better than ReLU representations without weight decay, but does not improve upon ReLU in walker-run and hopper-hop. Results are reported across ten seeds. The shaded regions represent the standard deviation. Evaluation at different noise levels is performed at 100k increments throughout training.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Definition 1