Table of Contents
Fetching ...

Free3D: Consistent Novel View Synthesis without 3D Representation

Chuanxia Zheng, Andrea Vedaldi

TL;DR

Free3D tackles open-set monocular novel view synthesis by divorcing from explicit 3D representations and leveraging a pretrained 2D generator. It introduces a ray conditioning normalization (RCN) layer to encode per-pixel viewing rays, along with a lightweight pseudo-3D cross-attention and shared noise to enforce cross-view consistency. The method achieves superior pose accuracy and multi-view coherence on Objaverse and generalizes well to unseen datasets such as OmniObject3D and GSO, without per-scene 3D optimization. This approach offers a practical, scalable baseline for 3D-free NVS with strong open-set generalization and real-world applicability.

Abstract

We introduce Free3D, a simple accurate method for monocular open-set novel view synthesis (NVS). Similar to Zero-1-to-3, we start from a pre-trained 2D image generator for generalization, and fine-tune it for NVS. Compared to other works that took a similar approach, we obtain significant improvements without resorting to an explicit 3D representation, which is slow and memory-consuming, and without training an additional network for 3D reconstruction. Our key contribution is to improve the way the target camera pose is encoded in the network, which we do by introducing a new ray conditioning normalization (RCN) layer. The latter injects pose information in the underlying 2D image generator by telling each pixel its viewing direction. We further improve multi-view consistency by using light-weight multi-view attention layers and by sharing generation noise between the different views. We train Free3D on the Objaverse dataset and demonstrate excellent generalization to new categories in new datasets, including OmniObject3D and GSO. The project page is available at https://chuanxiaz.com/free3d/.

Free3D: Consistent Novel View Synthesis without 3D Representation

TL;DR

Free3D tackles open-set monocular novel view synthesis by divorcing from explicit 3D representations and leveraging a pretrained 2D generator. It introduces a ray conditioning normalization (RCN) layer to encode per-pixel viewing rays, along with a lightweight pseudo-3D cross-attention and shared noise to enforce cross-view consistency. The method achieves superior pose accuracy and multi-view coherence on Objaverse and generalizes well to unseen datasets such as OmniObject3D and GSO, without per-scene 3D optimization. This approach offers a practical, scalable baseline for 3D-free NVS with strong open-set generalization and real-world applicability.

Abstract

We introduce Free3D, a simple accurate method for monocular open-set novel view synthesis (NVS). Similar to Zero-1-to-3, we start from a pre-trained 2D image generator for generalization, and fine-tune it for NVS. Compared to other works that took a similar approach, we obtain significant improvements without resorting to an explicit 3D representation, which is slow and memory-consuming, and without training an additional network for 3D reconstruction. Our key contribution is to improve the way the target camera pose is encoded in the network, which we do by introducing a new ray conditioning normalization (RCN) layer. The latter injects pose information in the underlying 2D image generator by telling each pixel its viewing direction. We further improve multi-view consistency by using light-weight multi-view attention layers and by sharing generation noise between the different views. We train Free3D on the Objaverse dataset and demonstrate excellent generalization to new categories in new datasets, including OmniObject3D and GSO. The project page is available at https://chuanxiaz.com/free3d/.
Paper Structure (40 sections, 5 equations, 11 figures, 3 tables)

This paper contains 40 sections, 5 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Given a single input view, Free3D synthesizes consistent $360^\circ$ views accurately without using an explicit 3D representation. Trained on Objaverse only, it generalizes well to new datasets and categories.
  • Figure 2: The overall pipeline of our Free3D. (a) Given a single input image, the Free3D jointly predicts multiple target views, instead of processing them independently. (b) We propose a novel ray conditional normalization (RCN) layer, which uses a per-pixel oriented camera ray to module the latent features, enabling the model's ability to capture more precise viewpoints. (c) A memory-friendly pseudo-3D cross-attention module is introduced to efficiently bridge information across multiple generated views.
  • Figure 3: Perceptual Path Length Consistency (PPLC). To partly compensate for the viewpoint change, the second image is rectified w.r.t. the first before comparison. To illustrate the importance of using rectification, the figure shows two objects in a large azimuth $\phi:57.6^\circ$. The top row shows to the left an ideally-rendered image pair, which however attains a large LPIPS score due to the view change. To the right, rectification reduces this score. The bottom row shows the opposite, where a pair of incorrectly rendered views has its LPIPS increased by rectification.
  • Figure 4: Qualitative comparisons on Objaverse. Given a target pose, our Free3D significantly improves the accuracy of the generated pose compared to existing state-of-the-art methods. Note that Zero123-XL objaverseXL is trained on the much larger Objaverse-XL dataset objaverseXL, which contains 10 million 3D objects. More comparisons are provided in the supplement \ref{['fig:sota_obj_app1', 'fig:sota_obj_app2']}.
  • Figure 5: Qualitative comparisons on OmniObject3D (top two rows) and GSO (bottom two rows) dataset. Interestingly, exciting methods cannot deal with unconventional objects, such as the "pie" in the first row, while our Free3D is still robust for such a challenging scenario. More comparisons are provided in supplemental \ref{['fig:oo3d_gso_app1', 'fig:oo3d_gso_app2']}.
  • ...and 6 more figures