Table of Contents
Fetching ...

Neural Fourier Transform: A General Approach to Equivariant Representation Learning

Masanori Koyama, Kenji Fukumizu, Kohei Hayashi, Takeru Miyato

TL;DR

NFT provides a general framework for equivariant representation learning by learning a latent linear action of a group from data without explicit action knowledge. It connects Fourier analysis to nonlinear settings via invariant kernels and RKHS, proving existence and identifiability and presenting three NFT modes (U-NFT, G-NFT, g-NFT). Empirically, NFT recovers major symmetry modes in nonlinear deformations of signals and demonstrates strong OOD generalization and novel-view capabilities on image datasets, often outperforming standard DFT and some steerable baselines in challenging settings. By enabling data-dependent spectral decomposition and the incorporation of prior symmetry structure, NFT offers a flexible, theoretically grounded approach to symmetry-aware learning with broad applicability and several open questions for optimization guarantees and scalability.

Abstract

Symmetry learning has proven to be an effective approach for extracting the hidden structure of data, with the concept of equivariance relation playing the central role. However, most of the current studies are built on architectural theory and corresponding assumptions on the form of data. We propose Neural Fourier Transform (NFT), a general framework of learning the latent linear action of the group without assuming explicit knowledge of how the group acts on data. We present the theoretical foundations of NFT and show that the existence of a linear equivariant feature, which has been assumed ubiquitously in equivariance learning, is equivalent to the existence of a group invariant kernel on the dataspace. We also provide experimental results to demonstrate the application of NFT in typical scenarios with varying levels of knowledge about the acting group.

Neural Fourier Transform: A General Approach to Equivariant Representation Learning

TL;DR

NFT provides a general framework for equivariant representation learning by learning a latent linear action of a group from data without explicit action knowledge. It connects Fourier analysis to nonlinear settings via invariant kernels and RKHS, proving existence and identifiability and presenting three NFT modes (U-NFT, G-NFT, g-NFT). Empirically, NFT recovers major symmetry modes in nonlinear deformations of signals and demonstrates strong OOD generalization and novel-view capabilities on image datasets, often outperforming standard DFT and some steerable baselines in challenging settings. By enabling data-dependent spectral decomposition and the incorporation of prior symmetry structure, NFT offers a flexible, theoretically grounded approach to symmetry-aware learning with broad applicability and several open questions for optimization guarantees and scalability.

Abstract

Symmetry learning has proven to be an effective approach for extracting the hidden structure of data, with the concept of equivariance relation playing the central role. However, most of the current studies are built on architectural theory and corresponding assumptions on the form of data. We propose Neural Fourier Transform (NFT), a general framework of learning the latent linear action of the group without assuming explicit knowledge of how the group acts on data. We present the theoretical foundations of NFT and show that the existence of a linear equivariant feature, which has been assumed ubiquitously in equivariance learning, is equivalent to the existence of a group invariant kernel on the dataspace. We also provide experimental results to demonstrate the application of NFT in typical scenarios with varying levels of knowledge about the acting group.
Paper Structure (33 sections, 15 theorems, 62 equations, 22 figures, 1 table)

This paper contains 33 sections, 15 theorems, 62 equations, 22 figures, 1 table.

Key Result

Lemma 3.1

If $\textrm{span}\{ \Phi(\mathcal{X})\}$ is equal to the entire latent space, then eq.(eq:NFT) implies that $M(g)$ is a group representation, that is, $M(e)=Id$ and $M(gh)=M(g)M(h)$.

Figures (22)

  • Figure 1: Left: An image sequence produced by applying fisheye transformation after horizontal shifting. Right: 2D renderings of a spinning chair.
  • Figure 2: NFT framework. Each block corresponds to irreducible representation/frequency.
  • Figure 3: DFT result
  • Figure 4: Left:the sequence of length=$128$ signals constructed by applying the shift operation with constant speed. Right: the sequence of the same function with time deformation.
  • Figure 5: Left: Long horizon future prediction of the sequence of time-warped sigmals. Center: $\mathbb{E}[\langle \rho_f | B \rangle]$ plotted against $f \in [0, 64]$ for each block $B$ in the block diagonalized $M_*$s learned from the dataset with 5 major frequencies ($\{8, 15, 22, 40, 45 \}$) and 2 noise frequencies with small coefficients ($\{18, 43 \}$) when $M_*$s can express at most 5 frequencies. Note that $\mathbb{E}[\langle \rho_f | M_* \rangle]$ is linear with respect to $M^*$ (Appendix \ref{['sec:char']}). Right: Average absolute value of block-diagonalized $M_*$s.
  • ...and 17 more figures

Theorems & Definitions (29)

  • Lemma 3.1
  • Theorem 4.1
  • Theorem 4.2
  • Theorem A.1: Schur's lemma
  • Lemma B.1
  • proof
  • Proposition C.1
  • proof
  • Corollary C.2
  • proof
  • ...and 19 more