Table of Contents
Fetching ...

MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences

Weitao Wang, Haoran Xu, Yuxiao Yang, Zhifang Liu, Jun Meng, Haoqian Wang

TL;DR

This work tackles the misalignment between automatic metrics and human preferences in image-to-3D evaluation by constructing a standardized prompt-and-annotation pipeline and introducing MVReward, a BLIP-based multi-view encoder reward model trained on 16k expert pairwise comparisons. It also proposes MVP, a plug-and-play tuning strategy that uses MVReward to align multi-view diffusion models with human preferences, improving geometry and texture quality across methods. Empirical results show MVReward outperforms traditional metrics in predicting human judgments, and MVP consistently enhances baseline multi-view diffusion models like Wonder3D and Era3D. The framework enables fair, transparent evaluation and more aligned generation in image-driven 3D synthesis, with potential for broader adoption in 3D content creation pipelines.

Abstract

Recent years have witnessed remarkable progress in 3D content generation. However, corresponding evaluation methods struggle to keep pace. Automatic approaches have proven challenging to align with human preferences, and the mixed comparison of text- and image-driven methods often leads to unfair evaluations. In this paper, we present a comprehensive framework to better align and evaluate multi-view diffusion models with human preferences. To begin with, we first collect and filter a standardized image prompt set from DALL$\cdot$E and Objaverse, which we then use to generate multi-view assets with several multi-view diffusion models. Through a systematic ranking pipeline on these assets, we obtain a human annotation dataset with 16k expert pairwise comparisons and train a reward model, coined MVReward, to effectively encode human preferences. With MVReward, image-driven 3D methods can be evaluated against each other in a more fair and transparent manner. Building on this, we further propose Multi-View Preference Learning (MVP), a plug-and-play multi-view diffusion tuning strategy. Extensive experiments demonstrate that MVReward can serve as a reliable metric and MVP consistently enhances the alignment of multi-view diffusion models with human preferences.

MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences

TL;DR

This work tackles the misalignment between automatic metrics and human preferences in image-to-3D evaluation by constructing a standardized prompt-and-annotation pipeline and introducing MVReward, a BLIP-based multi-view encoder reward model trained on 16k expert pairwise comparisons. It also proposes MVP, a plug-and-play tuning strategy that uses MVReward to align multi-view diffusion models with human preferences, improving geometry and texture quality across methods. Empirical results show MVReward outperforms traditional metrics in predicting human judgments, and MVP consistently enhances baseline multi-view diffusion models like Wonder3D and Era3D. The framework enables fair, transparent evaluation and more aligned generation in image-driven 3D synthesis, with potential for broader adoption in 3D content creation pipelines.

Abstract

Recent years have witnessed remarkable progress in 3D content generation. However, corresponding evaluation methods struggle to keep pace. Automatic approaches have proven challenging to align with human preferences, and the mixed comparison of text- and image-driven methods often leads to unfair evaluations. In this paper, we present a comprehensive framework to better align and evaluate multi-view diffusion models with human preferences. To begin with, we first collect and filter a standardized image prompt set from DALLE and Objaverse, which we then use to generate multi-view assets with several multi-view diffusion models. Through a systematic ranking pipeline on these assets, we obtain a human annotation dataset with 16k expert pairwise comparisons and train a reward model, coined MVReward, to effectively encode human preferences. With MVReward, image-driven 3D methods can be evaluated against each other in a more fair and transparent manner. Building on this, we further propose Multi-View Preference Learning (MVP), a plug-and-play multi-view diffusion tuning strategy. Extensive experiments demonstrate that MVReward can serve as a reliable metric and MVP consistently enhances the alignment of multi-view diffusion models with human preferences.

Paper Structure

This paper contains 31 sections, 2 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Automatic metrics often struggle to align with human preferences in evaluating image-to-3D tasks. Our MVReward model fills this gap and our MVP further enhances the alignment of existing multi-view diffusion models with human preferences.
  • Figure 2: An overview of our whole framework. Our human annotation dataset is constructed by a text prompt $\Rightarrow$ image prompt $\Rightarrow$ multi-view images $\Rightarrow$ human annotation procedure (Sec.3). Then we train our MVReward model, which includes a multi-view encoder and a scorer to effectively encode human preferences and evaluate multi-view images (Sec.4). Finally we propose MVP to fine-tune multi-view diffusion models by combining pre-trained loss with our reward loss (Sec.5).
  • Figure 3: Example of our text prompt enhancements to prevent potential semantic loss brought by the background removal of image-to-3D methods.
  • Figure 4: Examples from the image prompt distribution. Images closer to the edges represent objects with more complex and creative geometry or texture, while those near the center are mainly simple and common.
  • Figure 5: MVReward architecture with a multi-view encoder and a scorer to encode and predict human preferences.
  • ...and 3 more figures