Table of Contents
Fetching ...

Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

Ziyi Zhang, Li Shen, Deheng Ye, Yong Luo, Huangxuan Zhao, Lefei Zhang

TL;DR

This work tackles the problem of generating coherent multiview scenes from a single text prompt using few-step diffusion. It introduces MVC-ZigAL, an RL finetuning framework that formulates T2MV denoising as a multiview MDP, uses ZMV-Sampling to reinforce conditioning at test time, and employs MV-ZigAL to transfer gains into the base policy. To balance per-view fidelity and cross-view consistency, it further adopts a constrained optimization with a Lagrangian formulation, yielding MVC-ZigAL which couples view-level and joint-view feedback. Empirical results on DDPO/MATE-3D show that MVC-ZigAL achieves superior fidelity and cross-view consistency, maintaining few-step efficiency while surpassing state-of-the-art baselines in text-to-multiview generation. The approach offers a practical, scalable path for high-quality multiview synthesis from text prompts in real-time or near real-time settings.

Abstract

Text-to-multiview (T2MV) generation, which produces coherent multiview images from a single text prompt, remains computationally intensive, while accelerated T2MV methods using few-step diffusion models often sacrifice image fidelity and view consistency. To address this, we propose a novel reinforcement learning (RL) finetuning framework tailored for few-step T2MV diffusion models to jointly optimize per-view fidelity and cross-view consistency. Specifically, we first reformulate T2MV denoising across all views as a single unified Markov decision process, enabling multiview-aware policy optimization driven by a joint-view reward objective. Next, we introduce ZMV-Sampling, a test-time T2MV sampling technique that adds an inversion-denoising pass to reinforce both viewpoint and text conditioning, resulting in improved T2MV generation at the cost of inference time. To internalize its performance gains into the base sampling policy, we develop MV-ZigAL, a novel policy optimization strategy that uses reward advantages of ZMV-Sampling over standard sampling as learning signals for policy updates. Finally, noting that the joint-view reward objective under-optimizes per-view fidelity but naively optimizing single-view metrics neglects cross-view alignment, we reframe RL finetuning for T2MV diffusion models as a constrained optimization problem that maximizes per-view fidelity subject to an explicit joint-view constraint, thereby enabling more efficient and balanced policy updates. By integrating this constrained optimization paradigm with MV-ZigAL, we establish our complete RL finetuning framework, referred to as MVC-ZigAL, which effectively refines the few-step T2MV diffusion baseline in both fidelity and consistency while preserving its few-step efficiency.

Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

TL;DR

This work tackles the problem of generating coherent multiview scenes from a single text prompt using few-step diffusion. It introduces MVC-ZigAL, an RL finetuning framework that formulates T2MV denoising as a multiview MDP, uses ZMV-Sampling to reinforce conditioning at test time, and employs MV-ZigAL to transfer gains into the base policy. To balance per-view fidelity and cross-view consistency, it further adopts a constrained optimization with a Lagrangian formulation, yielding MVC-ZigAL which couples view-level and joint-view feedback. Empirical results on DDPO/MATE-3D show that MVC-ZigAL achieves superior fidelity and cross-view consistency, maintaining few-step efficiency while surpassing state-of-the-art baselines in text-to-multiview generation. The approach offers a practical, scalable path for high-quality multiview synthesis from text prompts in real-time or near real-time settings.

Abstract

Text-to-multiview (T2MV) generation, which produces coherent multiview images from a single text prompt, remains computationally intensive, while accelerated T2MV methods using few-step diffusion models often sacrifice image fidelity and view consistency. To address this, we propose a novel reinforcement learning (RL) finetuning framework tailored for few-step T2MV diffusion models to jointly optimize per-view fidelity and cross-view consistency. Specifically, we first reformulate T2MV denoising across all views as a single unified Markov decision process, enabling multiview-aware policy optimization driven by a joint-view reward objective. Next, we introduce ZMV-Sampling, a test-time T2MV sampling technique that adds an inversion-denoising pass to reinforce both viewpoint and text conditioning, resulting in improved T2MV generation at the cost of inference time. To internalize its performance gains into the base sampling policy, we develop MV-ZigAL, a novel policy optimization strategy that uses reward advantages of ZMV-Sampling over standard sampling as learning signals for policy updates. Finally, noting that the joint-view reward objective under-optimizes per-view fidelity but naively optimizing single-view metrics neglects cross-view alignment, we reframe RL finetuning for T2MV diffusion models as a constrained optimization problem that maximizes per-view fidelity subject to an explicit joint-view constraint, thereby enabling more efficient and balanced policy updates. By integrating this constrained optimization paradigm with MV-ZigAL, we establish our complete RL finetuning framework, referred to as MVC-ZigAL, which effectively refines the few-step T2MV diffusion baseline in both fidelity and consistency while preserving its few-step efficiency.

Paper Structure

This paper contains 54 sections, 33 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: Text-to-multiview results from MV-Aapter (SDXL), MV-Aapter (LCM-SDXL), and our MVC-ZigAL. MVC-ZigAL delivers consistent and high-fidelity views even in a few-step setting.
  • Figure 2: Comparison of full-step and first-step zigzag schedules for ZMV-Sampling on the non-finetuned MV-Adapter (LCM-SDXL) baseline. Restricting the zigzag pass to the first sampling step yields fine-grained image details, whereas the full-step schedule tends to over-smooth textures.
  • Figure 3: Joint-view optimization emphasizes view consistency but under-optimizes image fidelity; single-view optimization targets image fidelity but compromises view consistency, causing the "multi-face" problem. In contrast, constrained optimization explicitly balances consistency and fidelity.
  • Figure 4: Left: Reward curves tracking the HyperScore gap between standard sampling and ZMV-Sampling during MVC-ZigAL training. Middel & Right: Trade-off curves between single-view fidelity (PickScore, HPSv2) and cross-view consistency (HyperScore) across different methods.
  • Figure 5: Comparison of reward curves (left & middle) and Lagrange multiplier dynamics (right) during training of MVC-ZigAL variants with either adaptive or fixed constraint thresholds.
  • ...and 6 more figures