Table of Contents
Fetching ...

$SE(3)$ Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation

Yinshuang Xu, Dian Chen, Katherine Liu, Sergey Zakharov, Rares Ambrus, Kostas Daniilidis, Vitor Guizilini

TL;DR

This paper proposes to embed equivariant multi-view learning into the Perceiver IO architecture, employing Spherical Harmonics for positional encoding to ensure 3D rotation equivariance, and develops a specialized equivariant encoder and decoder within the Perceiver IO architecture.

Abstract

Incorporating inductive bias by embedding geometric entities (such as rays) as input has proven successful in multi-view learning. However, the methods adopting this technique typically lack equivariance, which is crucial for effective 3D learning. Equivariance serves as a valuable inductive prior, aiding in the generation of robust multi-view features for 3D scene understanding. In this paper, we explore the application of equivariant multi-view learning to depth estimation, not only recognizing its significance for computer vision and robotics but also addressing the limitations of previous research. Most prior studies have either overlooked equivariance in this setting or achieved only approximate equivariance through data augmentation, which often leads to inconsistencies across different reference frames. To address this issue, we propose to embed $SE(3)$ equivariance into the Perceiver IO architecture. We employ Spherical Harmonics for positional encoding to ensure 3D rotation equivariance, and develop a specialized equivariant encoder and decoder within the Perceiver IO architecture. To validate our model, we applied it to the task of stereo depth estimation, achieving state of the art results on real-world datasets without explicit geometric constraints or extensive data augmentation.

$SE(3)$ Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation

TL;DR

This paper proposes to embed equivariant multi-view learning into the Perceiver IO architecture, employing Spherical Harmonics for positional encoding to ensure 3D rotation equivariance, and develops a specialized equivariant encoder and decoder within the Perceiver IO architecture.

Abstract

Incorporating inductive bias by embedding geometric entities (such as rays) as input has proven successful in multi-view learning. However, the methods adopting this technique typically lack equivariance, which is crucial for effective 3D learning. Equivariance serves as a valuable inductive prior, aiding in the generation of robust multi-view features for 3D scene understanding. In this paper, we explore the application of equivariant multi-view learning to depth estimation, not only recognizing its significance for computer vision and robotics but also addressing the limitations of previous research. Most prior studies have either overlooked equivariance in this setting or achieved only approximate equivariance through data augmentation, which often leads to inconsistencies across different reference frames. To address this issue, we propose to embed equivariance into the Perceiver IO architecture. We employ Spherical Harmonics for positional encoding to ensure 3D rotation equivariance, and develop a specialized equivariant encoder and decoder within the Perceiver IO architecture. To validate our model, we applied it to the task of stereo depth estimation, achieving state of the art results on real-world datasets without explicit geometric constraints or extensive data augmentation.

Paper Structure

This paper contains 48 sections, 28 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Given a sparse set of posed images (red), the task is to estimate depth for a novel viewpoint (blue). The Perceiver IO struggles to accurately predict depth when the reference frame (gray) changes, equivalent to an inverse transformation applied to the object and cameras. In contrast, our model delivers the consistent result due to its equivariant design.
  • Figure 2: Our proposed Equivariant Perceiver IO (EPIO) architecture. (a) We take as input the concatenation of per-pixel image, ray, and camera embeddings, the latter two calculated using spherical harmonics. (b) The output of our equivariant encoder is a global latent code, including both global invariant and equivariant components. From those, we extract an equivariant reference frame through an equivariant MLP, while simultaneously obtaining invariant latents through inner product. (c) When a query camera is positioned in this equivariant reference frame, its pose becomes invariant, which enables the use of conventional Fourier basis to encode it. (d) Given an invariant latent and invariant pose, we use a conventional Perceiver IO decoder to generate predictions for each query ray.
  • Figure 3: Comparison between an equivariant input embedding in our model (left) and the conventional input embedding in DeFiNe (right). (a) Pipeline used to generate input embeddings for the encoder, resulting in cross-attention keys and values. (b) To generate geometric information, we calculate embeddings for each ray $r^i_{uv}$ and camera relative position $t_i -\bar{t}$; (c) The final composed embedding format includes both image embeddings, which are invariant, and geometric embeddings, which are equivariant. In contrast, the conventional approach by Perceiver IO, as highlighted in parts (a) and (c), integrates Fourier positional encodings with image embeddings to form the input embeddings. Furthermore, as indicated in (b), Perceiver IO utilizes each ray $r^i_{uv}$ and the absolute translation $t_i$ for positional encoding purposes.
  • Figure 4: Left: Our equivariant module is distinct from traditional implementations vaswani2017attention in its fundamental layers and the key-query product, that are crafted to be respectively equivariant and invariant. Right: Equivariant latent array used as additional input to the encoder. We apply equivariant positional encoding to each camera rotation, which is then averaged. We leverage an equivariant linear layer to get a global geometric latent $\oplus_lG_l$, which is concatenated with the conventional latent array $\mathcal{R}_0$ to compose our proposed equivariant latent array $\mathcal{R}'_0$.
  • Figure 5: Equivariant latent code and predicted frame. For simplicity, we use object rotation to denote the inverse rotation of the reference frame. When the object is rotated, our latent code and predicted canonical frame are also rotated.
  • ...and 9 more figures