Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

Ziyi Zhang; Li Shen; Deheng Ye; Yong Luo; Huangxuan Zhao; Lefei Zhang

Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

Ziyi Zhang, Li Shen, Deheng Ye, Yong Luo, Huangxuan Zhao, Lefei Zhang

TL;DR

This work tackles the problem of generating coherent multiview scenes from a single text prompt using few-step diffusion. It introduces MVC-ZigAL, an RL finetuning framework that formulates T2MV denoising as a multiview MDP, uses ZMV-Sampling to reinforce conditioning at test time, and employs MV-ZigAL to transfer gains into the base policy. To balance per-view fidelity and cross-view consistency, it further adopts a constrained optimization with a Lagrangian formulation, yielding MVC-ZigAL which couples view-level and joint-view feedback. Empirical results on DDPO/MATE-3D show that MVC-ZigAL achieves superior fidelity and cross-view consistency, maintaining few-step efficiency while surpassing state-of-the-art baselines in text-to-multiview generation. The approach offers a practical, scalable path for high-quality multiview synthesis from text prompts in real-time or near real-time settings.

Abstract

Text-to-multiview (T2MV) generation, which produces coherent multiview images from a single text prompt, remains computationally intensive, while accelerated T2MV methods using few-step diffusion models often sacrifice image fidelity and view consistency. To address this, we propose a novel reinforcement learning (RL) finetuning framework tailored for few-step T2MV diffusion models to jointly optimize per-view fidelity and cross-view consistency. Specifically, we first reformulate T2MV denoising across all views as a single unified Markov decision process, enabling multiview-aware policy optimization driven by a joint-view reward objective. Next, we introduce ZMV-Sampling, a test-time T2MV sampling technique that adds an inversion-denoising pass to reinforce both viewpoint and text conditioning, resulting in improved T2MV generation at the cost of inference time. To internalize its performance gains into the base sampling policy, we develop MV-ZigAL, a novel policy optimization strategy that uses reward advantages of ZMV-Sampling over standard sampling as learning signals for policy updates. Finally, noting that the joint-view reward objective under-optimizes per-view fidelity but naively optimizing single-view metrics neglects cross-view alignment, we reframe RL finetuning for T2MV diffusion models as a constrained optimization problem that maximizes per-view fidelity subject to an explicit joint-view constraint, thereby enabling more efficient and balanced policy updates. By integrating this constrained optimization paradigm with MV-ZigAL, we establish our complete RL finetuning framework, referred to as MVC-ZigAL, which effectively refines the few-step T2MV diffusion baseline in both fidelity and consistency while preserving its few-step efficiency.

Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

TL;DR

Abstract

Refining Few-Step Text-to-Multiview Diffusion via Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)