A Quantitative Evaluation of Score Distillation Sampling Based Text-to-3D

Xiaohan Fei; Chethan Parameshwara; Jiawei Mo; Xiaolong Li; Ashwin Swaminathan; CJ Taylor; Paolo Favaro; Stefano Soatto

A Quantitative Evaluation of Score Distillation Sampling Based Text-to-3D

Xiaohan Fei, Chethan Parameshwara, Jiawei Mo, Xiaolong Li, Ashwin Swaminathan, CJ Taylor, Paolo Favaro, Stefano Soatto

TL;DR

This work targets the lack of quantitative metrics for SDS-based text-to-3D by introducing an objective evaluation protocol that measures the Janus problem, text-3D alignment, and realism, validated against human judgments. It formalizes the SDS objective and analyzes its core challenges, including viewpoint conditioning and nuisance variability, and demonstrates how a multiview diffusion framework can mitigate some failures. The authors propose a two-stage baseline combining Multiview Diffusion and Gaussian Splatting, with a refinement stage that fuses SDS signals from MVDream and Stable Diffusion to improve fidelity, while carefully managing the Janus trade-off. Empirically, the protocol reveals strengths and limitations of current methods, and the full approach achieves competitive alignment and realism with favorable efficiency, establishing a strong, reusable baseline for future text-to-3D research.

Abstract

The development of generative models that create 3D content from a text prompt has made considerable strides thanks to the use of the score distillation sampling (SDS) method on pre-trained diffusion models for image generation. However, the SDS method is also the source of several artifacts, such as the Janus problem, the misalignment between the text prompt and the generated 3D model, and 3D model inaccuracies. While existing methods heavily rely on the qualitative assessment of these artifacts through visual inspection of a limited set of samples, in this work we propose more objective quantitative evaluation metrics, which we cross-validate via human ratings, and show analysis of the failure cases of the SDS technique. We demonstrate the effectiveness of this analysis by designing a novel computationally efficient baseline model that achieves state-of-the-art performance on the proposed metrics while addressing all the above-mentioned artifacts.

A Quantitative Evaluation of Score Distillation Sampling Based Text-to-3D

TL;DR

Abstract

Paper Structure (21 sections, 4 equations, 10 figures, 4 tables)

This paper contains 21 sections, 4 equations, 10 figures, 4 tables.

Introduction
Problem formulation and analysis
Score distillation sampling
Analysis of score distillation sampling
Method
Multiview diffusion
Gaussian Splatting
Proposed baseline method
Proposed evaluation protocol
Considered state-of-the-art methods
Used text prompts
Quality metrics
Training efficiency metric
Related work
Conclusion
...and 6 more sections

Figures (10)

Figure 1: Demonstration of the Janus problem in DreamFusion poole2023dreamfusion -- a state-of-the-art text-to-3D model. Prompt: "A corgi." We show renderings of the generated 3D corgi from different viewpoints (left to right): Front, left, back, and right view of the corgi.
Figure 2: Ablation of the regularization term. Prompt: "An astronaut riding a horse." Left: our method without any regularization. Right: our method with the regularization term Eq. \ref{['eq-sparsity']}, which effectively reduces floaters.
Figure 3: Ablation of the refinement stage. Two examples are shown. In each example, the image on the left shows the rendering of our first stage model, and the one on the right shows the rendering of our full model. The refinement stage greatly improves the alignment between the 3D content and the text prompt as well as the realism of the 3D content.
Figure 4: Correlation between CLIP R-Precision and human annotation. The abscissa shows the percentage of the generated 3D content that are annotated by human reviewers as aligning well with the given text prompts. The ordinate shows CLIP R-Precision. We can see from the plot that CLIP R-Precision correlates well with the human annotation, which supports our choice of the CLIP R-Precision as the algorithmic method for evaluating text-to-3D alignment.
Figure 5: Qualitative comparison. Each row shows 3D models generated by one method using different prompts. Prompts used (left to right): "a DSLR photo of A very beautiful tiny human heart organic sculpture made of copper wire and threaded pipes, very intricate, curved, Studio lighting, high resolution", "a zoomed out DSLR photo of a colorful camping tent in a patch of grass",, "a DSLR photo of the Imperial State Crown of England", "a bald eagle carved out of wood", "a tiger karate master", "an astronaut riding a kangaroo". Due to the space limit, we do not include results of DreamFusion+PerpNeg (similar to DreamFusion) and DreamGaussian (very poor compared to its updated version DreamGaussian+MVDream). More visual results can be found in the supplementary material.
...and 5 more figures

A Quantitative Evaluation of Score Distillation Sampling Based Text-to-3D

TL;DR

Abstract

A Quantitative Evaluation of Score Distillation Sampling Based Text-to-3D

Authors

TL;DR

Abstract

Table of Contents

Figures (10)