Table of Contents
Fetching ...

Appreciate the View: A Task-Aware Evaluation Framework for Novel View Synthesis

Saar Stern, Ido Sobol, Or Litany

TL;DR

This work addresses the challenge of reliably evaluating novel view synthesis by introducing a task-aware framework that leverages diffusion-model features from the strong NVS backbone Zero123-XL. The authors construct PRISM, consisting of a reference-based distance $D_{PRISM}$ and a reference-free distributional measure $MMD_{PRISM}$, both optimized through a contrastive fine-tuning on a purpose-built ViewMatch dataset to discriminate plausible vs. implausible view syntheses. Across multiple benchmarks (Toys4K, Google Scanned Objects, OmniObject3D) and human studies, $D_{PRISM}$ shows strong alignment with human preferences, while $MMD_{PRISM}$ yields stable, interpretable model rankings without ground-truth targets. The framework demonstrates robustness to pose misalignment and image degradations, offering a principled, practical pathway toward more reliable progress in single-view NVS evaluation, with limitations including dependence on the Zero123-XL backbone and potential extensions to scene-level scenarios.

Abstract

The goal of Novel View Synthesis (NVS) is to generate realistic images of a given content from unseen viewpoints. But how can we trust that a generated image truly reflects the intended transformation? Evaluating its reliability remains a major challenge. While recent generative models, particularly diffusion-based approaches, have significantly improved NVS quality, existing evaluation metrics struggle to assess whether a generated image is both realistic and faithful to the source view and intended viewpoint transformation. Standard metrics, such as pixel-wise similarity and distribution-based measures, often mis-rank incorrect results as they fail to capture the nuanced relationship between the source image, viewpoint change, and generated output. We propose a task-aware evaluation framework that leverages features from a strong NVS foundation model, Zero123, combined with a lightweight tuning step to enhance discrimination. Using these features, we introduce two complementary evaluation metrics: a reference-based score, $D_{\text{PRISM}}$, and a reference-free score, $\text{MMD}_{\text{PRISM}}$. Both reliably identify incorrect generations and rank models in agreement with human preference studies, addressing a fundamental gap in NVS evaluation. Our framework provides a principled and practical approach to assessing synthesis quality, paving the way for more reliable progress in novel view synthesis. To further support this goal, we apply our reference-free metric to six NVS methods across three benchmarks: Toys4K, Google Scanned Objects (GSO), and OmniObject3D, where $\text{MMD}_{\text{PRISM}}$ produces a clear and stable ranking, with lower scores consistently indicating stronger models.

Appreciate the View: A Task-Aware Evaluation Framework for Novel View Synthesis

TL;DR

This work addresses the challenge of reliably evaluating novel view synthesis by introducing a task-aware framework that leverages diffusion-model features from the strong NVS backbone Zero123-XL. The authors construct PRISM, consisting of a reference-based distance and a reference-free distributional measure , both optimized through a contrastive fine-tuning on a purpose-built ViewMatch dataset to discriminate plausible vs. implausible view syntheses. Across multiple benchmarks (Toys4K, Google Scanned Objects, OmniObject3D) and human studies, shows strong alignment with human preferences, while yields stable, interpretable model rankings without ground-truth targets. The framework demonstrates robustness to pose misalignment and image degradations, offering a principled, practical pathway toward more reliable progress in single-view NVS evaluation, with limitations including dependence on the Zero123-XL backbone and potential extensions to scene-level scenarios.

Abstract

The goal of Novel View Synthesis (NVS) is to generate realistic images of a given content from unseen viewpoints. But how can we trust that a generated image truly reflects the intended transformation? Evaluating its reliability remains a major challenge. While recent generative models, particularly diffusion-based approaches, have significantly improved NVS quality, existing evaluation metrics struggle to assess whether a generated image is both realistic and faithful to the source view and intended viewpoint transformation. Standard metrics, such as pixel-wise similarity and distribution-based measures, often mis-rank incorrect results as they fail to capture the nuanced relationship between the source image, viewpoint change, and generated output. We propose a task-aware evaluation framework that leverages features from a strong NVS foundation model, Zero123, combined with a lightweight tuning step to enhance discrimination. Using these features, we introduce two complementary evaluation metrics: a reference-based score, , and a reference-free score, . Both reliably identify incorrect generations and rank models in agreement with human preference studies, addressing a fundamental gap in NVS evaluation. Our framework provides a principled and practical approach to assessing synthesis quality, paving the way for more reliable progress in novel view synthesis. To further support this goal, we apply our reference-free metric to six NVS methods across three benchmarks: Toys4K, Google Scanned Objects (GSO), and OmniObject3D, where produces a clear and stable ranking, with lower scores consistently indicating stronger models.

Paper Structure

This paper contains 49 sections, 14 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Standard metrics (PSNR, SSIM, LPIPS, CLIP-S) often mis-rank incorrect generations in novel view synthesis. Our metric, $D_{\text{PRISM}}$, penalizes these incorrect outputs, aligning more closely with human judgments. Each pair shows outputs from different NVS models under the same input, with the output favored by each metric indicated.
  • Figure 2: Method Overview. (Left) Feature extraction: given source, target, and camera transformation, we noise the target image and extract features from a diffusion-based NVS model. These are pooled and tuned into $f_{\text{PRISM}}$. (Right) Evaluation framework: Full-Reference: measure distance between $f_{\text{PRISM}}$ of a predicted triplet and its ground-truth counterpart. No-Reference: compute MMD between $f_{\text{PRISM}}$ from generated triplets and an anchor set of real triplets.
  • Figure 3: Overview of our ViewMatch creation process of positive and negative target examples. (Top Left) Given a 3D mesh and source and target viewpoints, we extract visibility and invisibility masks of the target view from the source, based on the visible faces of the target from the source. (Bottom Left) Given a 3D mesh and source and target viewpoints, we extract an epipolar invisibility mask, representing the unseen regions from the target view, beyond the object. (Right) We augment the visibility and invisibility masks with parts of the epipolar masks, to enable shape changes, and pass the true target and the created masks to an inpainting model.
  • Figure 4: Examples from ViewMatch. Each group shows a source view, its ground-truth target, and three generated samples. Positives preserve consistency; negatives violate it through targeted inpainting as described in \ref{['sec:dataset']}.
  • Figure 5: Degradation of $D_{\text{PRISM}}$ (denoted as $D$ in the plot) under Gaussian blur at increasing intensity levels.
  • ...and 9 more figures