Table of Contents
Fetching ...

Splat2Real: Novel-view Scaling for Physical AI with 3D Gaussian Splatting

Hansol Lim, Jongseong Brad Choi

TL;DR

This work introduces CN-Coverage, a coverage+novelty curriculum that greedily selects views by geometry gain and an extrapolation penalty, plus a quality-aware guardrail fallback for low-reliability teachers, and presents Splat2Real, centered on novel-view scaling.

Abstract

Physical AI faces viewpoint shift between training and deployment, and novel-view robustness is essential for monocular RGB-to-3D perception. We cast Real2Render2Real monocular depth pretraining as imitation-learning-style supervision from a digital twin oracle: a student depth network imitates expert metric depth/visibility rendered from a scene mesh, while 3DGS supplies scalable novel-view observations. We present Splat2Real, centered on novel-view scaling: performance depends more on which views are added than on raw view count. We introduce CN-Coverage, a coverage+novelty curriculum that greedily selects views by geometry gain and an extrapolation penalty, plus a quality-aware guardrail fallback for low-reliability teachers. Across 20 TUM RGB-D sequences with step-matched budgets (N=0 to 2000 additional rendered views, with N unique <= 500 and resampling for larger budgets), naive scaling is unstable; CN-Coverage mitigates worst-case regressions relative to Robot/Coverage policies, and GOL-Gated CN-Coverage provides the strongest medium-high-budget stability with the lowest high-novelty tail error. Downstream control-proxy results versus N provides embodied-relevance evidence by shifting safety/progress trade-offs under viewpoint shift.

Splat2Real: Novel-view Scaling for Physical AI with 3D Gaussian Splatting

TL;DR

This work introduces CN-Coverage, a coverage+novelty curriculum that greedily selects views by geometry gain and an extrapolation penalty, plus a quality-aware guardrail fallback for low-reliability teachers, and presents Splat2Real, centered on novel-view scaling.

Abstract

Physical AI faces viewpoint shift between training and deployment, and novel-view robustness is essential for monocular RGB-to-3D perception. We cast Real2Render2Real monocular depth pretraining as imitation-learning-style supervision from a digital twin oracle: a student depth network imitates expert metric depth/visibility rendered from a scene mesh, while 3DGS supplies scalable novel-view observations. We present Splat2Real, centered on novel-view scaling: performance depends more on which views are added than on raw view count. We introduce CN-Coverage, a coverage+novelty curriculum that greedily selects views by geometry gain and an extrapolation penalty, plus a quality-aware guardrail fallback for low-reliability teachers. Across 20 TUM RGB-D sequences with step-matched budgets (N=0 to 2000 additional rendered views, with N unique <= 500 and resampling for larger budgets), naive scaling is unstable; CN-Coverage mitigates worst-case regressions relative to Robot/Coverage policies, and GOL-Gated CN-Coverage provides the strongest medium-high-budget stability with the lowest high-novelty tail error. Downstream control-proxy results versus N provides embodied-relevance evidence by shifting safety/progress trade-offs under viewpoint shift.
Paper Structure (55 sections, 9 equations, 11 figures, 29 tables, 1 algorithm)

This paper contains 55 sections, 9 equations, 11 figures, 29 tables, 1 algorithm.

Figures (11)

  • Figure 1: Splat2Real pipeline. Real captures build a 3DGS observation teacher for high-throughput novel-view RGB rendering; simulator-style mesh rendering provides aligned metric oracle labels. A scaling policy (Random/Robot/Coverage/CN-Coverage) selects viewpoint budget $N$. Gated/composited fallback is used as a secondary safety layer.
  • Figure 2: Step-matched scaling: metric AbsRel vs budget $N$ for Random, Robot, Coverage, CN-Coverage, CN-Coverage (MeshHist), and GOL-Gated CN-Coverage. Error bars denote 95% CI across sequences; connecting lines are guides to the eye. Budgets are discrete and plotted on categorical x-axis positions. The dashed vertical marker at $N{=}500$ denotes the unique-view cap; for $N{>}500$, training samples with replacement from the selected set (resampling-based count scaling).
  • Figure 3: AbsRel versus surface-coverage fraction. Each point is one (sampler, $N$) setting.
  • Figure 4: Error versus pose-novelty bins at low/high $N$ in vertical subpanels: (a) $N=0$, (b) $N=2000$. The x-axis uses per-sequence novelty quantile bins 1--5 (1 = least novel, 5 = most novel), so all 20 sequences contribute to each panel. Shaded regions denote 95% CI across sequences.
  • Figure 5: Teacher-quality interaction scatter at multiple budgets, where $\Delta$AbsRel $=$ AbsRel(GS) $-$ AbsRel(mesh+Hist) and negative is better. Dashed line marks gate threshold $q_s=1.0$. Correlations (Pearson/Spearman) are $N{=}0$: 0.06/$-0.07$, $N{=}500$: 0.21/0.08, $N{=}2000$: $-0.23$/$-0.11$.
  • ...and 6 more figures