Table of Contents
Fetching ...

Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models

Zhihao Zhang, Xuejun Yang, Weihua Liu, Mouquan Shen

TL;DR

The paper tackles the quality sensitivity of diffusion-based single-view novel view synthesis by learning a high-quality noise representation. It introduces an inference–inversion-based pipeline to inject semantic information into the initial noise and trains a lightweight encoder–decoder network (EDN) to map random noise to high-quality noise, which plugs into pretrained NVS models without architectural changes. Through a diffusion-model–driven noise collection and filtering stage, the method yields improved multi-view consistency and detail across SV3D and Mv-Adapter on multiple datasets, with negligible inference overhead. This work enables better NVS performance without fine-tuning diffusion architectures, offering a practical, scalable enhancement for diffusion-based 3D view synthesis.

Abstract

Single-view novel view synthesis (NVS) models based on diffusion models have recently attracted increasing attention, as they can generate a series of novel view images from a single image prompt and camera pose information as conditions. It has been observed that in diffusion models, certain high-quality initial noise patterns lead to better generation results than others. However, there remains a lack of dedicated learning frameworks that enable NVS models to learn such high-quality noise. To obtain high-quality initial noise from random Gaussian noise, we make the following contributions. First, we design a discretized Euler inversion method to inject image semantic information into random noise, thereby constructing paired datasets of random and high-quality noise. Second, we propose a learning framework based on an encoder-decoder network (EDN) that directly transforms random noise into high-quality noise. Experiments demonstrate that the proposed EDN can be seamlessly plugged into various NVS models, such as SV3D and MV-Adapter, achieving significant performance improvements across multiple datasets. Code is available at: https://github.com/zhihao0512/EDN.

Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models

TL;DR

The paper tackles the quality sensitivity of diffusion-based single-view novel view synthesis by learning a high-quality noise representation. It introduces an inference–inversion-based pipeline to inject semantic information into the initial noise and trains a lightweight encoder–decoder network (EDN) to map random noise to high-quality noise, which plugs into pretrained NVS models without architectural changes. Through a diffusion-model–driven noise collection and filtering stage, the method yields improved multi-view consistency and detail across SV3D and Mv-Adapter on multiple datasets, with negligible inference overhead. This work enables better NVS performance without fine-tuning diffusion architectures, offering a practical, scalable enhancement for diffusion-based 3D view synthesis.

Abstract

Single-view novel view synthesis (NVS) models based on diffusion models have recently attracted increasing attention, as they can generate a series of novel view images from a single image prompt and camera pose information as conditions. It has been observed that in diffusion models, certain high-quality initial noise patterns lead to better generation results than others. However, there remains a lack of dedicated learning frameworks that enable NVS models to learn such high-quality noise. To obtain high-quality initial noise from random Gaussian noise, we make the following contributions. First, we design a discretized Euler inversion method to inject image semantic information into random noise, thereby constructing paired datasets of random and high-quality noise. Second, we propose a learning framework based on an encoder-decoder network (EDN) that directly transforms random noise into high-quality noise. Experiments demonstrate that the proposed EDN can be seamlessly plugged into various NVS models, such as SV3D and MV-Adapter, achieving significant performance improvements across multiple datasets. Code is available at: https://github.com/zhihao0512/EDN.

Paper Structure

This paper contains 23 sections, 13 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Results of two NVS models generated from random Gaussian noise and our EDN-optimized noise, respectively. Images are generated using the same random seed and camera poses. Images synthesized with EDN exhibit better consistency with the ground truth in both appearance contours and local details.
  • Figure 2: The workflow of our high-quality noise learning framework with three stages. Stage I: We first denoise the initial Gaussian noise $\mathbf{z} _{T}$​ to obtain $\mathbf{z} _{T-n}$. Then, using the discretized Euler inversion method, we derive the inverted noise $\tilde{\mathbf{z}}_{T}$, which is infused with semantic information from the reference image. The resulting samples are further filtered to ensure that constructed training dataset is both diverse and representative. Stage II: The initial noise $\mathbf{z} _{T}$ and the VAE embedding $\mathbf{I}$ of the reference image are concatenated and fed into the EDN. The EDN decoder then predicts a semantic information map, which is used to compute the loss based on its differences from both $\mathbf{z} _{T}$ and the inverted noise $\tilde{\mathbf{z}}_{T}$. Stage III: During inference, the EDN injects the predicted image semantic information into the initial random noise before it enters the diffusion reverse process. This produces high-quality noise that enhances the generation performance of the pretrained NVS model.
  • Figure 3: Visual results of different novel view synthesis models on dynamic orbits.
  • Figure 4: EDN with pose prompts. (a) EDN with sine-based pose embedding. The camera's azimuth and elevation angles are encoded using sine embedding, and then, transformed through AdaGroupNorm into a tensor matching the shape of the Gaussian noise. This tensor is concatenated with the VAE embedding of the reference image and the Gaussian noise, before being fed into the EDN. (b) EDN with ray map embedding. The camera pose is encoded using Plücker embedding, and converted into a ray map aligned with the Gaussian noise. The ray map is concatenated with the VAE embedding and the Gaussian noise, and the combined input is passed into the EDN.