Table of Contents
Fetching ...

Sphinx: Efficiently Serving Novel View Synthesis using Regression-Guided Selective Refinement

Yuchen Xia, Souvik Kundu, Mosharaf Chowdhury, Nishil Talati

TL;DR

Sphinx addresses the core challenge in novel view synthesis of delivering diffusion-level fidelity with substantial efficiency gains. It achieves this by a training-free hybrid pipeline that initializes diffusion with a regression prediction and then performs selective, region-aware refinement guided by adaptive noise scheduling, temporal latent reuse, and sparse convolution. The framework uses CLIP-derived scene clustering to drive adaptive denoising depth and refinement strategies while preserving temporal and spatial coherence. Empirical results across RE10K, DL3DV, and ACID show average speedups around 1.8× (up to 2.2×) with less than 5% perceptual degradation, establishing a Pareto frontier that balances fidelity, latency, and energy for dynamic inference scenarios.

Abstract

Novel View Synthesis (NVS) is the task of generating new images of a scene from viewpoints that were not part of the original input. Diffusion-based NVS can generate high-quality, temporally consistent images, however, remains computationally prohibitive. Conversely, regression-based NVS offers suboptimal generation quality despite requiring significantly lower compute; leaving the design objective of a high-quality, inference-efficient NVS framework an open challenge. To close this critical gap, we present Sphinx, a training-free hybrid inference framework that achieves diffusion-level fidelity at a significantly lower compute. Sphinx proposes to use regression-based fast initialization to guide and reduce the denoising workload for the diffusion model. Additionally, it integrates selective refinement with adaptive noise scheduling, allowing more compute to uncertain regions and frames. This enables Sphinx to provide flexible navigation of the performance-quality trade-off, allowing adaptation to latency and fidelity requirements for dynamically changing inference scenarios. Our evaluation shows that Sphinx achieves an average 1.8x speedup over diffusion model inference with negligible perceptual degradation of less than 5%, establishing a new Pareto frontier between quality and latency in NVS serving.

Sphinx: Efficiently Serving Novel View Synthesis using Regression-Guided Selective Refinement

TL;DR

Sphinx addresses the core challenge in novel view synthesis of delivering diffusion-level fidelity with substantial efficiency gains. It achieves this by a training-free hybrid pipeline that initializes diffusion with a regression prediction and then performs selective, region-aware refinement guided by adaptive noise scheduling, temporal latent reuse, and sparse convolution. The framework uses CLIP-derived scene clustering to drive adaptive denoising depth and refinement strategies while preserving temporal and spatial coherence. Empirical results across RE10K, DL3DV, and ACID show average speedups around 1.8× (up to 2.2×) with less than 5% perceptual degradation, establishing a Pareto frontier that balances fidelity, latency, and energy for dynamic inference scenarios.

Abstract

Novel View Synthesis (NVS) is the task of generating new images of a scene from viewpoints that were not part of the original input. Diffusion-based NVS can generate high-quality, temporally consistent images, however, remains computationally prohibitive. Conversely, regression-based NVS offers suboptimal generation quality despite requiring significantly lower compute; leaving the design objective of a high-quality, inference-efficient NVS framework an open challenge. To close this critical gap, we present Sphinx, a training-free hybrid inference framework that achieves diffusion-level fidelity at a significantly lower compute. Sphinx proposes to use regression-based fast initialization to guide and reduce the denoising workload for the diffusion model. Additionally, it integrates selective refinement with adaptive noise scheduling, allowing more compute to uncertain regions and frames. This enables Sphinx to provide flexible navigation of the performance-quality trade-off, allowing adaptation to latency and fidelity requirements for dynamically changing inference scenarios. Our evaluation shows that Sphinx achieves an average 1.8x speedup over diffusion model inference with negligible perceptual degradation of less than 5%, establishing a new Pareto frontier between quality and latency in NVS serving.

Paper Structure

This paper contains 23 sections, 2 equations, 15 figures, 2 tables, 1 algorithm.

Figures (15)

  • Figure 1: Overview of Sphinx that combines regression and diffusion models for low-latency, high-quality generation.
  • Figure 2: Example NVS task: two input camera frames are used to generate four intermediate novel views between them.
  • Figure 3: Proposed hybrid approach combining regression-based fast initialization with diffusion-based selective denoising.
  • Figure 4: Number of denoising steps for each target frame in our pipeline on the RE10K dataset, with error bars indicating variability across scenes.
  • Figure 5: Qualitative comparison between regression output (MVSplat) and ground truth. The regression outputs contain disoccluded regions (highlighted in red) where artifacts and missing details appear, while the remaining regions are clean and consistent.
  • ...and 10 more figures