Table of Contents
Fetching ...

Neural Isometries: Taming Transformations for Equivariant ML

Thomas W. Mitchel, Michael Taylor, Vincent Sitzmann

TL;DR

Neural Isometries, an autoencoder framework which learns to map the observation space to a general-purpose latent space wherein encodings are related by isometries whenever their corresponding observations are geometrically related in world space, is introduced.

Abstract

Real-world geometry and 3D vision tasks are replete with challenging symmetries that defy tractable analytical expression. In this paper, we introduce Neural Isometries, an autoencoder framework which learns to map the observation space to a general-purpose latent space wherein encodings are related by isometries whenever their corresponding observations are geometrically related in world space. Specifically, we regularize the latent space such that maps between encodings preserve a learned inner product and commute with a learned functional operator, in the same manner as rigid-body transformations commute with the Laplacian. This approach forms an effective backbone for self-supervised representation learning, and we demonstrate that a simple off-the-shelf equivariant network operating in the pre-trained latent space can achieve results on par with meticulously-engineered, handcrafted networks designed to handle complex, nonlinear symmetries. Furthermore, isometric maps capture information about the respective transformations in world space, and we show that this allows us to regress camera poses directly from the coefficients of the maps between encodings of adjacent views of a scene.

Neural Isometries: Taming Transformations for Equivariant ML

TL;DR

Neural Isometries, an autoencoder framework which learns to map the observation space to a general-purpose latent space wherein encodings are related by isometries whenever their corresponding observations are geometrically related in world space, is introduced.

Abstract

Real-world geometry and 3D vision tasks are replete with challenging symmetries that defy tractable analytical expression. In this paper, we introduce Neural Isometries, an autoencoder framework which learns to map the observation space to a general-purpose latent space wherein encodings are related by isometries whenever their corresponding observations are geometrically related in world space. Specifically, we regularize the latent space such that maps between encodings preserve a learned inner product and commute with a learned functional operator, in the same manner as rigid-body transformations commute with the Laplacian. This approach forms an effective backbone for self-supervised representation learning, and we demonstrate that a simple off-the-shelf equivariant network operating in the pre-trained latent space can achieve results on par with meticulously-engineered, handcrafted networks designed to handle complex, nonlinear symmetries. Furthermore, isometric maps capture information about the respective transformations in world space, and we show that this allows us to regress camera poses directly from the coefficients of the maps between encodings of adjacent views of a scene.
Paper Structure (50 sections, 29 equations, 8 figures, 4 tables)

This paper contains 50 sections, 29 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Neural Isometries find latent spaces where complex transformations become tractable.
  • Figure 2: Overview of Neural Isometries (NIso). NIso learn a latent space where transformations of observations manifest as isometries, achieved by regularizing the functional maps $\tau$ between latents to commute with a learned operator $\Omega$, parameterized via its spectral decomposition into a mass matrix $\mathbb{M}$, eigenfunctions $\Phi$, and eigenvalues $\Lambda$ (sec. \ref{['sec: iso_reg']}). Given two observations $\psi$ and $T\psi$ related by some unknown transformation $T$ (in this case, camera motion in a 3D scene), they are first encoded into latent functions$\mathcal{E}{}(\psi)$ and $\mathcal{E}{}(T\psi)$ and projected into the operator eigenbasis. An isometric functional map $\tau{}_{\Omega{}}$ is estimated between them, and used to map one to the other. Losses promote isometry-equivariance in the latent space, reconstruction of transformed latents, and distinct, low-multiplicity eigenvalues $\Lambda$, with the latter encouraging a diagonal as possible $\tau{}_{\Omega{}}$. An optional spectral dropout layer can be applied before the basis unprojection to encourage a physically meaningful ordering of the learned spectrum (sec. \ref{['sec:training']}).
  • Figure 3: Approximating the Laplacian. Forced to map between shifted images on the torus (first row, left) and rotated images on the sphere (second row, left), NIso regress operators (center right) structurally similar to the toric and spherical Laplacian (right). Maps $\tau{}_{\Omega{}}{}$ between projected images are strongly diagonal (center left), with individual blocks (inset) preserving the subspaces spanned by eigenfunctions (center, first $64$ shown) sharing nearly the same eigenvalues. These experiments result in the discovery of basis with the similar properties to the the toric and spherical harmonics. In particular, the estimated spherical $\tau_\Omega$ manifest exactly the same structure as the ground truth Wigner-D matrices corresponding to the rotation, with square blocks of size $(2\ell + 1) \times (2\ell + 1)$ for the $\ell$-th distinct eigenvalue. Please zoom in to view structural details.
  • Figure 4: Visualizing the Learned Eigenfunctions $\Phi$ and Mass Matrices $\mathbb{M}{}$. Visualizations of the eigenfunctions $\Phi$ learned in each experiment are shown on the top row. Eigenfunctions are sorted by eigenvalue in ascending order along rows in C-style indexing. Here, experiments were performed without spectral dropout so the ordering is random. The elements of the learned diagonal mass matrices $\mathbb{M}$ are shown on the bottom row, in terms of the magnitude of the deviation from the mean value at each grid index in the latent space. White indicates little deviation from the mean, with green-blue indicating mass values above the mean and orange-red indicating mass values below. In the MNIST experiments (sec. \ref{['sec:hmnist_exp']}), the distribution of mass appears to segment null space from the central region most often occupied by the digits. In the conformal shape classification experiments (sec. \ref{['sec: conf_class']}), the larger deviations from the mean values appear closer to the poles (the top-most and bottom-most rows of the spherical grid). For the pose estimation experiments (sec. \ref{['sec:pose_est']}), larger deviations appear at the boundaries, with the lower half of the grid having slightly higher values.
  • Figure 5: Qualitative Pose Estimation Comparisons. Example predicted trajectories for each method on select CO3Dv2 evaluation sequences. Ground truth is shown in black, NIso in red, the NFT in blue, and the transformer baseline in green. NIso appears to consistently better capture rotational (curvature) information about the world space transformations, helping it to better track the the camera motion across scales.
  • ...and 3 more figures