Table of Contents
Fetching ...

UniView: Enhancing Novel View Synthesis From A Single Image By Unifying Reference Features

Haowang Cui, Rui Chen, Jiaze Wang, Tao Guo, Zheng Qin

TL;DR

A novel model dubbed as UniView is proposed, which can leverage reference images from a similar object to provide strong prior information during view synthesis and significantly improves novel view synthesis performance and outperforms state-of-the-art methods on the challenging datasets.

Abstract

The task of synthesizing novel views from a single image is highly ill-posed due to multiple explanations for unobserved areas. Most current methods tend to generate unseen regions from ambiguity priors and interpolation near input views, which often lead to severe distortions. To address this limitation, we propose a novel model dubbed as UniView, which can leverage reference images from a similar object to provide strong prior information during view synthesis. More specifically, we construct a retrieval and augmentation system and employ a multimodal large language model (MLLM) to assist in selecting reference images that meet our requirements. Additionally, a plug-and-play adapter module with multi-level isolation layers is introduced to dynamically generate reference features for the target views. Moreover, in order to preserve the details of an original input image, we design a decoupled triple attention mechanism, which can effectively align and integrate multi-branch features into the synthesis process. Extensive experiments have demonstrated that our UniView significantly improves novel view synthesis performance and outperforms state-of-the-art methods on the challenging datasets.

UniView: Enhancing Novel View Synthesis From A Single Image By Unifying Reference Features

TL;DR

A novel model dubbed as UniView is proposed, which can leverage reference images from a similar object to provide strong prior information during view synthesis and significantly improves novel view synthesis performance and outperforms state-of-the-art methods on the challenging datasets.

Abstract

The task of synthesizing novel views from a single image is highly ill-posed due to multiple explanations for unobserved areas. Most current methods tend to generate unseen regions from ambiguity priors and interpolation near input views, which often lead to severe distortions. To address this limitation, we propose a novel model dubbed as UniView, which can leverage reference images from a similar object to provide strong prior information during view synthesis. More specifically, we construct a retrieval and augmentation system and employ a multimodal large language model (MLLM) to assist in selecting reference images that meet our requirements. Additionally, a plug-and-play adapter module with multi-level isolation layers is introduced to dynamically generate reference features for the target views. Moreover, in order to preserve the details of an original input image, we design a decoupled triple attention mechanism, which can effectively align and integrate multi-branch features into the synthesis process. Extensive experiments have demonstrated that our UniView significantly improves novel view synthesis performance and outperforms state-of-the-art methods on the challenging datasets.

Paper Structure

This paper contains 14 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Motivation. Standard single-image NVS models (e.g., Zero123++) fail to synthesize occluded regions. UniView utilizes a reference image from a similar object to guide the synthesis, restoring correct geometry.
  • Figure 2: The architecture of UniView. The system leverages a multimodal large model to retrieve the optimal reference image from the database based on the input condition image. Then, an image pair composed of a condition image and a reference image is processed through a Meta-Adapter, which integrates a Base-Adapter and a Meta-Controller. Subsequently, the output is incorporated into a pre-trained multi-view diffusion model via a Decoupled Triple Attention mechanism. Zero convolution layers are strategically inserted between the Base-Adapter and Meta-Controller modules, as well as preceding the Decoupled Triple Attention mechanism, to ensure effective isolation.
  • Figure 3: Qualitative results of UniView and the baseline.