Table of Contents
Fetching ...

Painting 3D Nature in 2D: View Synthesis of Natural Scenes from a Single Semantic Mask

Shangzan Zhang, Sida Peng, Tianrun Chen, Linzhan Mou, Haotong Lin, Kaicheng Yu, Yiyi Liao, Xiaowei Zhou

TL;DR

This work tackles the problem of free-viewpoint rendering of natural scenes from a single semantic mask, addressing the scarcity of multi-view data. It introduces a two-stage framework that first produces multi-view semantic masks via warping, inpainting, and a neural semantic field, then translates them to RGB images with SPADE while learning a neural scene representation to enforce view consistency. The semantic field fusion and surface-guided rendering enable photorealistic, multi-view-consistent results trained on single-view image collections, outperforming strong baselines in both quantitative metrics (FID/KID) and human assessments. The approach demonstrates practical potential for editing 2D semantic masks to create coherent 3D natural-scene content, with limitations including per-scene optimization and room for amortized inference in future work.

Abstract

We introduce a novel approach that takes a single semantic mask as input to synthesize multi-view consistent color images of natural scenes, trained with a collection of single images from the Internet. Prior works on 3D-aware image synthesis either require multi-view supervision or learning category-level prior for specific classes of objects, which can hardly work for natural scenes. Our key idea to solve this challenging problem is to use a semantic field as the intermediate representation, which is easier to reconstruct from an input semantic mask and then translate to a radiance field with the assistance of off-the-shelf semantic image synthesis models. Experiments show that our method outperforms baseline methods and produces photorealistic, multi-view consistent videos of a variety of natural scenes.

Painting 3D Nature in 2D: View Synthesis of Natural Scenes from a Single Semantic Mask

TL;DR

This work tackles the problem of free-viewpoint rendering of natural scenes from a single semantic mask, addressing the scarcity of multi-view data. It introduces a two-stage framework that first produces multi-view semantic masks via warping, inpainting, and a neural semantic field, then translates them to RGB images with SPADE while learning a neural scene representation to enforce view consistency. The semantic field fusion and surface-guided rendering enable photorealistic, multi-view-consistent results trained on single-view image collections, outperforming strong baselines in both quantitative metrics (FID/KID) and human assessments. The approach demonstrates practical potential for editing 2D semantic masks to create coherent 3D natural-scene content, with limitations including per-scene optimization and room for amortized inference in future work.

Abstract

We introduce a novel approach that takes a single semantic mask as input to synthesize multi-view consistent color images of natural scenes, trained with a collection of single images from the Internet. Prior works on 3D-aware image synthesis either require multi-view supervision or learning category-level prior for specific classes of objects, which can hardly work for natural scenes. Our key idea to solve this challenging problem is to use a semantic field as the intermediate representation, which is easier to reconstruct from an input semantic mask and then translate to a radiance field with the assistance of off-the-shelf semantic image synthesis models. Experiments show that our method outperforms baseline methods and produces photorealistic, multi-view consistent videos of a variety of natural scenes.
Paper Structure (45 sections, 14 equations, 12 figures, 4 tables)

This paper contains 45 sections, 14 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Given only a single semantic map as input (first row), our approach optimizes neural fields for view synthesis of natural scenes. Photorealistic images can be rendered via neural fields (the last two rows).
  • Figure 2: Illustration of our pipeline.Left: Our pipeline can be divided into two steps: we first generate multi-view semantic masks with an inpainting network and then convert semantic masks to RGB images using SPADE. In order to denoise and fuse semantic information, a semantic field is learned for rendering multi-view consistent masks. Finally, a neural scene representation is optimized to fuse appearance information provided by SPADE, which enables view-consistent rendering. Right: Our semantic inpainting network and SPADE are trained on single-view image collections.
  • Figure 3: Training a semantic inpainting network. Our semantic inpainting network takes the $\mathbf{S}_{i\rightarrow j\rightarrow i}$ as input and is trained to recover the $\mathbf{S}_i$.
  • Figure 4: The effectiveness of semantic field. The cropped patch clearly indicates the minor change in the semantic masks across different viewpoints (the first and second columns are adjacent viewpoints) brings the unwanted large region change in RGB images generated by SPADE.
  • Figure 5: Qualitative comparisons on the LHQ dataset. We produce more realistic rendering compared to all baselines, which are demonstrated by the supplementary video.
  • ...and 7 more figures