Table of Contents
Fetching ...

ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare

Freeman Cheng, Botao Ye, Xueting Li, Junqi You, Fangneng Zhan, Ming-Hsuan Yang

TL;DR

ReCoSplat is presented, an autoregressive feed-forward Gaussian Splatting model supporting posed or unposed inputs, with or without camera intrinsics, and a hybrid KV cache compression strategy combining early-layer truncation with chunk-level selective retention, reducing the KV cache size by over 90% for 100+ frames.

Abstract

Online novel view synthesis remains challenging, requiring robust scene reconstruction from sequential, often unposed, observations. We present ReCoSplat, an autoregressive feed-forward Gaussian Splatting model supporting posed or unposed inputs, with or without camera intrinsics. While assembling local Gaussians using camera poses scales better than canonical-space prediction, it creates a dilemma during training: using ground-truth poses ensures stability but causes a distribution mismatch when predicted poses are used at inference. To address this, we introduce a Render-and-Compare (ReCo) module. ReCo renders the current reconstruction from the predicted viewpoint and compares it with the incoming observation, providing a stable conditioning signal that compensates for pose errors. To support long sequences, we propose a hybrid KV cache compression strategy combining early-layer truncation with chunk-level selective retention, reducing the KV cache size by over 90% for 100+ frames. ReCoSplat achieves state-of-the-art performance across different input settings on both in- and out-of-distribution benchmarks. Code and pretrained models will be released. Our project page is at https://freemancheng.com/ReCoSplat .

ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare

TL;DR

ReCoSplat is presented, an autoregressive feed-forward Gaussian Splatting model supporting posed or unposed inputs, with or without camera intrinsics, and a hybrid KV cache compression strategy combining early-layer truncation with chunk-level selective retention, reducing the KV cache size by over 90% for 100+ frames.

Abstract

Online novel view synthesis remains challenging, requiring robust scene reconstruction from sequential, often unposed, observations. We present ReCoSplat, an autoregressive feed-forward Gaussian Splatting model supporting posed or unposed inputs, with or without camera intrinsics. While assembling local Gaussians using camera poses scales better than canonical-space prediction, it creates a dilemma during training: using ground-truth poses ensures stability but causes a distribution mismatch when predicted poses are used at inference. To address this, we introduce a Render-and-Compare (ReCo) module. ReCo renders the current reconstruction from the predicted viewpoint and compares it with the incoming observation, providing a stable conditioning signal that compensates for pose errors. To support long sequences, we propose a hybrid KV cache compression strategy combining early-layer truncation with chunk-level selective retention, reducing the KV cache size by over 90% for 100+ frames. ReCoSplat achieves state-of-the-art performance across different input settings on both in- and out-of-distribution benchmarks. Code and pretrained models will be released. Our project page is at https://freemancheng.com/ReCoSplat .
Paper Structure (43 sections, 12 equations, 9 figures, 8 tables)

This paper contains 43 sections, 12 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Autoregressive Reconstruction. We present ReCoSplat, a method that reconstructs 3D scenes from sequential image streams.
  • Figure 2: ReCoSplat overview. Given a chunk of input images, a DINO-based encoder extracts features, which are then processed by an alternating-attention transformer with an autoregressive KV cache. The pose head predicts camera poses, while the Gaussian head predicts local Gaussian primitives. To bridge the pose distribution mismatch between training and inference, the Render-and-Compare module renders the current reconstruction at the assembly pose and compares it with the incoming observation to provide conditioning via cross-attention. The predicted local Gaussians are then transformed to world coordinates and merged into the accumulated scene.
  • Figure 3: Novel view synthesis with increasing input views under unposed settings. ReCoSplat improves geometry and reduces artifacts over autoregressive baselines.
  • Figure 4: Novel view synthesis in the fully posed and calibrated setting with 128 and 256 input views. With pose errors removed, reconstruction quality depends on local Gaussian prediction accuracy. Our method module can correct for local Gaussian mispredictions, outperforming autoregressive baselines and beating YoNoSplat in PSNR.
  • Figure 5: Peak GPU memory usage on an A6000 GPU. KV-cache compression significantly suppresses memory usage compared to YoNoSplat and an uncompressed baselime. OOM thresholds for common GPU models are marked.
  • ...and 4 more figures