Table of Contents
Fetching ...

FILT3R: Latent State Adaptive Kalman Filter for Streaming 3D Reconstruction

Seonghyun Jin, Jong Chul Ye

Abstract

Streaming 3D reconstruction maintains a persistent latent state that is updated online from incoming frames, enabling constant-memory inference. A key failure mode is the state update rule: aggressive overwrites forget useful history, while conservative updates fail to track new evidence, and both behaviors become unstable beyond the training horizon. To address this challenge, we propose FILT3R, a training-free latent filtering layer that casts recurrent state updates as stochastic state estimation in token space. FILT3R maintains a per-token variance and computes a Kalman-style gain that adaptively balances memory retention against new observations. Process noise -- governing how much the latent state is expected to change between frames -- is estimated online from EMA-normalized temporal drift of candidate tokens. Using extensive experiments, we demonstrate that FILT3R yields an interpretable, plug-in update rule that generalizes common overwrite and gating policies as special cases. Specifically, we show that gains shrink in stable regimes as uncertainty contracts with accumulated evidence, and rise when genuine scene change increases process uncertainty, improving long-horizon stability for depth, pose, and 3D reconstruction, compared to the existing methods. Code will be released at https://github.com/jinotter3/FILT3R.

FILT3R: Latent State Adaptive Kalman Filter for Streaming 3D Reconstruction

Abstract

Streaming 3D reconstruction maintains a persistent latent state that is updated online from incoming frames, enabling constant-memory inference. A key failure mode is the state update rule: aggressive overwrites forget useful history, while conservative updates fail to track new evidence, and both behaviors become unstable beyond the training horizon. To address this challenge, we propose FILT3R, a training-free latent filtering layer that casts recurrent state updates as stochastic state estimation in token space. FILT3R maintains a per-token variance and computes a Kalman-style gain that adaptively balances memory retention against new observations. Process noise -- governing how much the latent state is expected to change between frames -- is estimated online from EMA-normalized temporal drift of candidate tokens. Using extensive experiments, we demonstrate that FILT3R yields an interpretable, plug-in update rule that generalizes common overwrite and gating policies as special cases. Specifically, we show that gains shrink in stable regimes as uncertainty contracts with accumulated evidence, and rise when genuine scene change increases process uncertainty, improving long-horizon stability for depth, pose, and 3D reconstruction, compared to the existing methods. Code will be released at https://github.com/jinotter3/FILT3R.
Paper Structure (57 sections, 2 theorems, 26 equations, 29 figures, 16 tables, 1 algorithm)

This paper contains 57 sections, 2 theorems, 26 equations, 29 figures, 16 tables, 1 algorithm.

Key Result

proposition 1

Consider a scalar token in a static regime: the true latent is constant $s^\star$, measurements are $\tilde{s}_t = s^\star + v_t$ with $v_t \sim \mathcal{N}(0,r)$, and process noise is zero ($q=0$). Then the Kalman recursion gives

Figures (29)

  • Figure 1: FILT3R enables stable long-horizon streaming 3D reconstruction without resets. Top: dense input frames from a long video stream. Middle: final reconstructions from the same frozen streaming backbone under different latent-state update rules. Uniform overwrite (CUT3R) and heuristic gating (TTT3R) accumulate drift and/or forget previously integrated evidence, leading to long-horizon instability and degraded geometry. Bottom: a zoom-in around frames beyond 600 highlights that FILT3R preserves stable geometry over long rollouts. Trajectory colors indicate time.
  • Figure 2: FILT3R overview. The persistent memory $\mathbf{s}_{t-1}$ is fused with the decoder's candidate $\tilde{\mathbf{s}}_t$ via a token-wise Kalman gain $\mathbf{k}_t$. Process noise $\mathbf{q}_t$ is estimated adaptively from EMA-normalized temporal drift, while measurement noise $r$ is a scalar hyperparameter shared across tokens. Variance $\mathbf{p}_t$ is propagated across time steps, enabling gains that naturally shrink in stable regimes and increase during scene change.
  • Figure 3: Qualitative long-horizon streaming 3D reconstruction. CUT3R often suffers from catastrophic forgetting and drift, while TTT3R reduces but does not eliminate long-horizon instability, leading to fragmented surfaces and inconsistent revisited regions (red boxes). FILT3R produces more coherent geometry over long rollouts. Ground-truth trajectory is shown in orange; the predicted trajectory is color-coded by time.
  • Figure 5: Runtime and GPU memory (500 frames). Benchmark on NRGBD (512$\times$384, warmup=2, 10 runs). FILT3R (ours) avoids attention-map caching, matching CUT3R's footprint.
  • Figure 7: How different update rules relate to the gain curve. With fixed $r$, the steady-state Kalman gain is determined by the noise ratio ($q/r$). Overwrite (CUT3R) corresponds to the limit ($k\to1$). A fixed gate chooses one constant operating point. FILT3R adapts $q_t$ and therefore moves along the curve, becoming conservative in stable regimes and reopening at transitions. TTT3R is also adaptive, but its gate is heuristic and based on current cues rather than propagated covariance, so it is shown schematically off the steady-state curve.
  • ...and 24 more figures

Theorems & Definitions (4)

  • proposition 1: Static scenes yield naturally shrinking gains
  • proof
  • proposition 2: Steady-state gain vs. $q/r$
  • proof