Appreciate the View: A Task-Aware Evaluation Framework for Novel View Synthesis
Saar Stern, Ido Sobol, Or Litany
TL;DR
This work addresses the challenge of reliably evaluating novel view synthesis by introducing a task-aware framework that leverages diffusion-model features from the strong NVS backbone Zero123-XL. The authors construct PRISM, consisting of a reference-based distance $D_{PRISM}$ and a reference-free distributional measure $MMD_{PRISM}$, both optimized through a contrastive fine-tuning on a purpose-built ViewMatch dataset to discriminate plausible vs. implausible view syntheses. Across multiple benchmarks (Toys4K, Google Scanned Objects, OmniObject3D) and human studies, $D_{PRISM}$ shows strong alignment with human preferences, while $MMD_{PRISM}$ yields stable, interpretable model rankings without ground-truth targets. The framework demonstrates robustness to pose misalignment and image degradations, offering a principled, practical pathway toward more reliable progress in single-view NVS evaluation, with limitations including dependence on the Zero123-XL backbone and potential extensions to scene-level scenarios.
Abstract
The goal of Novel View Synthesis (NVS) is to generate realistic images of a given content from unseen viewpoints. But how can we trust that a generated image truly reflects the intended transformation? Evaluating its reliability remains a major challenge. While recent generative models, particularly diffusion-based approaches, have significantly improved NVS quality, existing evaluation metrics struggle to assess whether a generated image is both realistic and faithful to the source view and intended viewpoint transformation. Standard metrics, such as pixel-wise similarity and distribution-based measures, often mis-rank incorrect results as they fail to capture the nuanced relationship between the source image, viewpoint change, and generated output. We propose a task-aware evaluation framework that leverages features from a strong NVS foundation model, Zero123, combined with a lightweight tuning step to enhance discrimination. Using these features, we introduce two complementary evaluation metrics: a reference-based score, $D_{\text{PRISM}}$, and a reference-free score, $\text{MMD}_{\text{PRISM}}$. Both reliably identify incorrect generations and rank models in agreement with human preference studies, addressing a fundamental gap in NVS evaluation. Our framework provides a principled and practical approach to assessing synthesis quality, paving the way for more reliable progress in novel view synthesis. To further support this goal, we apply our reference-free metric to six NVS methods across three benchmarks: Toys4K, Google Scanned Objects (GSO), and OmniObject3D, where $\text{MMD}_{\text{PRISM}}$ produces a clear and stable ranking, with lower scores consistently indicating stronger models.
