Table of Contents
Fetching ...

Z-Splat: Z-Axis Gaussian Splatting for Camera-Sonar Fusion

Ziyuan Qu, Omkar Vengurlekar, Mohamad Qadri, Kevin Zhang, Michael Kaess, Christopher Metzler, Suren Jayasuriya, Adithya Pediredla

TL;DR

This work tackles the missing-cone problem in differentiable Gaussian splatting by introducing Z-Axis Gaussian Splatting that fuses RGB camera data with time-resolved sonar measurements (echosounder and FLS). The method extends the Gaussian splatting framework to the z-axis, modeling sonar transients as 1D or 2D histograms and integrating them with camera renders via a joint optimization. Across simulations, hardware emulation, and real-world experiments, the fusion approach yields up to ~5 dB improvements in novel-view PSNR and around 60% reductions in Chamfer distance for 3D geometry, with FLS offering consistent gains over echosounder. The results demonstrate significant practical benefits for accurate 3D reconstruction and view synthesis in challenging, small-baseline scenarios, including underwater and cluttered environments.

Abstract

Differentiable 3D-Gaussian splatting (GS) is emerging as a prominent technique in computer vision and graphics for reconstructing 3D scenes. GS represents a scene as a set of 3D Gaussians with varying opacities and employs a computationally efficient splatting operation along with analytical derivatives to compute the 3D Gaussian parameters given scene images captured from various viewpoints. Unfortunately, capturing surround view ($360^{\circ}$ viewpoint) images is impossible or impractical in many real-world imaging scenarios, including underwater imaging, rooms inside a building, and autonomous navigation. In these restricted baseline imaging scenarios, the GS algorithm suffers from a well-known 'missing cone' problem, which results in poor reconstruction along the depth axis. In this manuscript, we demonstrate that using transient data (from sonars) allows us to address the missing cone problem by sampling high-frequency data along the depth axis. We extend the Gaussian splatting algorithms for two commonly used sonars and propose fusion algorithms that simultaneously utilize RGB camera data and sonar data. Through simulations, emulations, and hardware experiments across various imaging scenarios, we show that the proposed fusion algorithms lead to significantly better novel view synthesis (5 dB improvement in PSNR) and 3D geometry reconstruction (60% lower Chamfer distance).

Z-Splat: Z-Axis Gaussian Splatting for Camera-Sonar Fusion

TL;DR

This work tackles the missing-cone problem in differentiable Gaussian splatting by introducing Z-Axis Gaussian Splatting that fuses RGB camera data with time-resolved sonar measurements (echosounder and FLS). The method extends the Gaussian splatting framework to the z-axis, modeling sonar transients as 1D or 2D histograms and integrating them with camera renders via a joint optimization. Across simulations, hardware emulation, and real-world experiments, the fusion approach yields up to ~5 dB improvements in novel-view PSNR and around 60% reductions in Chamfer distance for 3D geometry, with FLS offering consistent gains over echosounder. The results demonstrate significant practical benefits for accurate 3D reconstruction and view synthesis in challenging, small-baseline scenarios, including underwater and cluttered environments.

Abstract

Differentiable 3D-Gaussian splatting (GS) is emerging as a prominent technique in computer vision and graphics for reconstructing 3D scenes. GS represents a scene as a set of 3D Gaussians with varying opacities and employs a computationally efficient splatting operation along with analytical derivatives to compute the 3D Gaussian parameters given scene images captured from various viewpoints. Unfortunately, capturing surround view ( viewpoint) images is impossible or impractical in many real-world imaging scenarios, including underwater imaging, rooms inside a building, and autonomous navigation. In these restricted baseline imaging scenarios, the GS algorithm suffers from a well-known 'missing cone' problem, which results in poor reconstruction along the depth axis. In this manuscript, we demonstrate that using transient data (from sonars) allows us to address the missing cone problem by sampling high-frequency data along the depth axis. We extend the Gaussian splatting algorithms for two commonly used sonars and propose fusion algorithms that simultaneously utilize RGB camera data and sonar data. Through simulations, emulations, and hardware experiments across various imaging scenarios, we show that the proposed fusion algorithms lead to significantly better novel view synthesis (5 dB improvement in PSNR) and 3D geometry reconstruction (60% lower Chamfer distance).
Paper Structure (18 sections, 14 equations, 12 figures, 7 tables, 2 algorithms)

This paper contains 18 sections, 14 equations, 12 figures, 7 tables, 2 algorithms.

Figures (12)

  • Figure 1: Sonar measurements provide complementary information. (a) Volumetric scene captured with three pairs of cameras and sonars (echosounder). We assume the sensors are in the far field (i.e., the affine approximation to the projective transform in Gaussian splatting research is valid). For the center camera-sonar pair, camera measurements are obtained by projecting the volumetric data along the vertical axis, and sonar measurements are obtained by projecting the volumetric data along the horizontal axis. (b) If only camera measurements are considered, then using the Fourier-slice theorem, we are capturing only a few slices of the Fourier transform of the volume and missing information on a large cone. (c) Sonar (time-resolved data) captures orthogonal slices in the Fourier space, and hence, 3D reconstruction of the scene is better conditioned if we do the camera-sensor fusion instead of using only camera data.
  • Figure 2: Ray View Transformation and Z-Axis Splatting (a) This illustration shows the camera view. The covariance of Gaussians in the camera view is $\varSigma = W^T \varSigma W$, which transforms the Gaussians from the world view to the camera view. (b) The Gaussians are transformed into the ray view through an local affine approximation of the projection transform using the Jacobian ($J$). The covariance matrix of the Gaussians will be $\varSigma' = J^T \varSigma J$. (c) The transformed 3D Gaussian is then projected (splat) onto the $xy$-plane for rendering camera and $z$-axis for rendering echosounder (for collocated camera and echosounder). The gray Gaussian is occluded by the Gaussian in the front, so the Transmission($T$) of that Gaussian is smaller than the others independent of whether we are rendering camera or sonar. Based on \ref{['alg:rendering_echosounder']} and \ref{['alg:FLS_rendering']}, each ray undergoes splatting independently, ensuring that if a Gaussian is rasterized by multiple rays, it will be splatted multiple times.
  • Figure 3: Simulation and emulation training for both echosounder and FLS fusion techniques (a) Raw depth image captured with Time-of-Flight (ToF) camera. (b) An RGB image captured with a camera. (c) Simulated echosounder intensity was generated using the depth histogram and utilized as ground truth during training. (d) A 3D Gaussian scene. We use $xy$-splatting to render RGB images and $z$-splatting to render echosounder depth intensity distribution. (e) Simulated FLS intensity generated by histogramming depth per row. (f) A 3D Gaussian scene, and we splat along $xy$-direction to render RGB image and along $yz$-direction to render FLS image. We minimize the sum of RGB loss and corresponding depth loss to train the camera-sonar fusion algorithms.
  • Figure 4: Novel view synthesis comparison: The incorporation of depth information notably mitigates the presence of floaters in the reconstructed scene. Moreover, depth information accurately positions the Gaussian kernels, particularly in scenes with uniform color or overexposure. The average SSIM, PSNR, and LPIPS metrics for the entire test set comprising 263 novel views are presented in \ref{['tab:simulated_data_photometric_complex']}.
  • Figure 5: Geometry comparison on one-object scenes. We captured the data by moving the camera only along the $x$-axis. We show ground truth meshes and superimpose the reconstructed Gaussians as point clouds. In the highlighted regions, we can observe that camera-only methods reconstruct the geometry inaccurately along the $z$-axis, whereas the proposed fusion techniques reconstruct the geometry accurately.
  • ...and 7 more figures