Table of Contents
Fetching ...

Learn Your Scales: Towards Scale-Consistent Generative Novel View Synthesis

Fereshteh Forghani, Jason J. Yu, Tristan Aumentado-Armstrong, Konstantinos G. Derpanis, Marcus A. Brubaker

TL;DR

Learn Your Scales tackles scale ambiguity in generative novel view synthesis (GNVS) trained on uncalibrated multiview data, where unknown scene scale induces uncertainty in generated views. The authors propose end-to-end per-scene scale learning, parameterizing $s_i$ as $s_i = \exp(a [\beta_i]_{-1}^{+1})$ and applying $\hat{\mathbf{t}}_j = s_i \mathbf{t}_j$ to camera translations during diffusion-model training, alongside two metrics—Sample Flow Consistency (SFC) and Scale-Sensitive Thresholded Symmetric Epipolar Distance (SS-TSED)—to quantify scale inconsistency. Empirical results on RealEstate10K with PolyOculus show reduced scale variability and improved image quality when learning scales, outperforming fixed-scale or ad-hoc normalization approaches, with additional gains when leveraging metric-depth references. The approach avoids preprocessing and enables robust GNVS on uncalibrated data, offering a practical, scalable solution to scale ambiguity in real-world multiview datasets.

Abstract

Conventional depth-free multi-view datasets are captured using a moving monocular camera without metric calibration. The scales of camera positions in this monocular setting are ambiguous. Previous methods have acknowledged scale ambiguity in multi-view data via various ad-hoc normalization pre-processing steps, but have not directly analyzed the effect of incorrect scene scales on their application. In this paper, we seek to understand and address the effect of scale ambiguity when used to train generative novel view synthesis methods (GNVS). In GNVS, new views of a scene or object can be minimally synthesized given a single image and are, thus, unconstrained, necessitating the use of generative methods. The generative nature of these models captures all aspects of uncertainty, including any uncertainty of scene scales, which act as nuisance variables for the task. We study the effect of scene scale ambiguity in GNVS when sampled from a single image by isolating its effect on the resulting models and, based on these intuitions, define new metrics that measure the scale inconsistency of generated views. We then propose a framework to estimate scene scales jointly with the GNVS model in an end-to-end fashion. Empirically, we show that our method reduces the scale inconsistency of generated views without the complexity or downsides of previous scale normalization methods. Further, we show that removing this ambiguity improves generated image quality of the resulting GNVS model.

Learn Your Scales: Towards Scale-Consistent Generative Novel View Synthesis

TL;DR

Learn Your Scales tackles scale ambiguity in generative novel view synthesis (GNVS) trained on uncalibrated multiview data, where unknown scene scale induces uncertainty in generated views. The authors propose end-to-end per-scene scale learning, parameterizing as and applying to camera translations during diffusion-model training, alongside two metrics—Sample Flow Consistency (SFC) and Scale-Sensitive Thresholded Symmetric Epipolar Distance (SS-TSED)—to quantify scale inconsistency. Empirical results on RealEstate10K with PolyOculus show reduced scale variability and improved image quality when learning scales, outperforming fixed-scale or ad-hoc normalization approaches, with additional gains when leveraging metric-depth references. The approach avoids preprocessing and enables robust GNVS on uncalibrated data, offering a practical, scalable solution to scale ambiguity in real-world multiview datasets.

Abstract

Conventional depth-free multi-view datasets are captured using a moving monocular camera without metric calibration. The scales of camera positions in this monocular setting are ambiguous. Previous methods have acknowledged scale ambiguity in multi-view data via various ad-hoc normalization pre-processing steps, but have not directly analyzed the effect of incorrect scene scales on their application. In this paper, we seek to understand and address the effect of scale ambiguity when used to train generative novel view synthesis methods (GNVS). In GNVS, new views of a scene or object can be minimally synthesized given a single image and are, thus, unconstrained, necessitating the use of generative methods. The generative nature of these models captures all aspects of uncertainty, including any uncertainty of scene scales, which act as nuisance variables for the task. We study the effect of scene scale ambiguity in GNVS when sampled from a single image by isolating its effect on the resulting models and, based on these intuitions, define new metrics that measure the scale inconsistency of generated views. We then propose a framework to estimate scene scales jointly with the GNVS model in an end-to-end fashion. Empirically, we show that our method reduces the scale inconsistency of generated views without the complexity or downsides of previous scale normalization methods. Further, we show that removing this ambiguity improves generated image quality of the resulting GNVS model.

Paper Structure

This paper contains 22 sections, 7 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Scale ambiguity and inconsistency in GNVS. (Top) Two novel views are independently sampled using the same conditioning and camera motion, $\Delta x$. Samples exhibit different disparities due to uncertainty over scene scale. Here, we depict the plausible top down scene layouts of these samples in the boxes. This uncertainty occurs when GNVS models are trained with inconsistently calibrated data. (Bottom) Additional samples and scenes in the same setting are shown where a salient edge is highlighted to show the different disparities in the generated views.
  • Figure 1: Effect of the number of conditioning views on sample scale variance. We take the right edge of the door and manually measure its movement in ten samples generated conditioning on one and two frames. We depict edge movements in both cases as boxplots. On the right, you can see four samples generated conditioned on one and two views, respectively.
  • Figure 2: SFC metric overview. Given a set of size $n$ of generated frames, we use an optical flow estimator to obtain optical flows between the conditioning and generated frames. The flows are masked and normalized. To illustrate the mask generation process, optical flow vectors of two patches are shown: a masked patch and an unmasked patch. Red flow vectors represent a group of masked pixels moving out of the field-of-view by forward flow. Blue flow vectors correspond to a group of unmasked pixels whose flows are in opposite directions, indicating cycle consistency. After masking, we normalize the masked flows with the average flow magnitude of unmasked pixels, $\bar{f}$. Finally, we compute per-pixel median absolute deviation (MAD) over $n$ masked normalized flows to get the MAD map, the median value of which is defined as SFC.
  • Figure 2: Mean absolute difference of log scales plot. This plot shows the trend of scale changes during training, and its plateau indicates scale convergence.
  • Figure 3: Examples of per-pixel optical flow MAD maps. The darker the pixel, the lower the variation of optical flow in that pixel, indicating a more consistent scale among the generated samples. The entropy in scale can also be seen by comparing the generated frames, e.g., the width of the door in the second row.
  • ...and 9 more figures