Table of Contents
Fetching ...

SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision

Avigail Cohen Rimon, Amir Mann, Mirela Ben Chen, Or Litany

Abstract

3D Gaussian Splatting (3DGS) enables real-time, photorealistic novel view synthesis, making it a highly attractive representation for model-based video tracking. However, leveraging the differentiability of the 3DGS renderer "in the wild" remains notoriously fragile. A fundamental bottleneck lies in the compact, local support of the Gaussian primitives. Standard photometric objectives implicitly rely on spatial overlap; if severe camera misalignment places the rendered object outside the target's local footprint, gradients strictly vanish, leaving the optimizer stranded. We introduce SpectralSplats, a robust tracking framework that resolves this "vanishing gradient" problem by shifting the optimization objective from the spatial to the frequency domain. By supervising the rendered image via a set of global complex sinusoidal features (Spectral Moments), we construct a global basin of attraction, ensuring that a valid, directional gradient toward the target exists across the entire image domain, even when pixel overlap is completely nonexistent. To harness this global basin without introducing periodic local minima associated with high frequencies, we derive a principled Frequency Annealing schedule from first principles, gracefully transitioning the optimizer from global convexity to precise spatial alignment. We demonstrate that SpectralSplats acts as a seamless, drop-in replacement for spatial losses across diverse deformation parameterizations (from MLPs to sparse control points), successfully recovering complex deformations even from severely misaligned initializations where standard appearance-based tracking catastrophically fails.

SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision

Abstract

3D Gaussian Splatting (3DGS) enables real-time, photorealistic novel view synthesis, making it a highly attractive representation for model-based video tracking. However, leveraging the differentiability of the 3DGS renderer "in the wild" remains notoriously fragile. A fundamental bottleneck lies in the compact, local support of the Gaussian primitives. Standard photometric objectives implicitly rely on spatial overlap; if severe camera misalignment places the rendered object outside the target's local footprint, gradients strictly vanish, leaving the optimizer stranded. We introduce SpectralSplats, a robust tracking framework that resolves this "vanishing gradient" problem by shifting the optimization objective from the spatial to the frequency domain. By supervising the rendered image via a set of global complex sinusoidal features (Spectral Moments), we construct a global basin of attraction, ensuring that a valid, directional gradient toward the target exists across the entire image domain, even when pixel overlap is completely nonexistent. To harness this global basin without introducing periodic local minima associated with high frequencies, we derive a principled Frequency Annealing schedule from first principles, gracefully transitioning the optimizer from global convexity to precise spatial alignment. We demonstrate that SpectralSplats acts as a seamless, drop-in replacement for spatial losses across diverse deformation parameterizations (from MLPs to sparse control points), successfully recovering complex deformations even from severely misaligned initializations where standard appearance-based tracking catastrophically fails.
Paper Structure (38 sections, 24 equations, 14 figures, 6 tables)

This paper contains 38 sections, 24 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: SpectralSplats enables robust tracking from zero-overlap initializations.Left: A 3DGS asset is initialized (see transparent overlay) far from some target pose image (solid image), resulting in strictly zero spatial overlap in the rendered camera view. Right: We compare the optimization progression. Standard photometric tracking (Pixel loss) implicitly requires spatial overlap; without it, directional gradients vanish, causing the optimizer to strand the asset and eventually collapse into spurious local minima. SpectralSplats (Ours) shifts supervision to the frequency domain via Spectral Moments. This establishes a global basin of attraction, allowing the Gaussians to smoothly flow across the image domain and successfully recover the extreme displacement.
  • Figure 2: Breaking the Locality Trap: A 1D Optimization Analysis. We simulate the optimization landscape (bottom) for aligning a rendered 1D Gaussian pulse (top, red) to a target (top, green) under a large initial spatial displacement ($\Theta_0 = 6$). Standard $\mathbf{L_2}$ (Col 1): Photometric objectives implicitly rely on spatial overlap; without it, the gradient strictly vanishes, leaving the optimizer stranded. No Annealing (Col 2): Projecting the loss onto a static, high-frequency spectral basis ($k=5$) ensures the gradient no longer vanishes globally, but introduces severe phase-wrapping that traps the optimizer in false local minima. Ours (Cols 3-6): Spectral Moment Supervision with Frequency Annealing. By restricting initial supervision to low frequencies, we construct a globally convex basin of attraction that provides a valid, directional gradient from any initialization. As the spatial error strictly decreases, our principled annealing schedule safely expands the active bandwidth, seamlessly transitioning the landscape to achieve high-frequency spatial precision without phase-wrapping.
  • Figure 3: A long low-frequency "warm-up" phase (right) leads to loss of high frequency details (tail), compared to a shorter "warm-up" phase (left).
  • Figure 4: (Left) Effect of initial spatial shift in GART, showing averaged PSNR, SSIM, and LPIPS versus shift radius; pixel-only supervision degrades rapidly, while our method remains stable. (Right) Corresponding results on SC4D wu2024sc4djiang2024consistentd, reporting PSNR for training and novel views; pixel loss deteriorates under misalignment, whereas our method maintains stable performance.
  • Figure 5: Qualitative comparison on the SC4D data under initial spatial shift (radius = 0.5). For three characters and animations, we show the initial pose, GT, MLP+Ours and MLP+Pixel, both without LPIPS. While pixel-only optimization fails to recover correct pose and may drift the object outside the frame, our method achieves better alignment and more coherent structure in both training and novel views.
  • ...and 9 more figures