Painting 3D Nature in 2D: View Synthesis of Natural Scenes from a Single Semantic Mask
Shangzan Zhang, Sida Peng, Tianrun Chen, Linzhan Mou, Haotong Lin, Kaicheng Yu, Yiyi Liao, Xiaowei Zhou
TL;DR
This work tackles the problem of free-viewpoint rendering of natural scenes from a single semantic mask, addressing the scarcity of multi-view data. It introduces a two-stage framework that first produces multi-view semantic masks via warping, inpainting, and a neural semantic field, then translates them to RGB images with SPADE while learning a neural scene representation to enforce view consistency. The semantic field fusion and surface-guided rendering enable photorealistic, multi-view-consistent results trained on single-view image collections, outperforming strong baselines in both quantitative metrics (FID/KID) and human assessments. The approach demonstrates practical potential for editing 2D semantic masks to create coherent 3D natural-scene content, with limitations including per-scene optimization and room for amortized inference in future work.
Abstract
We introduce a novel approach that takes a single semantic mask as input to synthesize multi-view consistent color images of natural scenes, trained with a collection of single images from the Internet. Prior works on 3D-aware image synthesis either require multi-view supervision or learning category-level prior for specific classes of objects, which can hardly work for natural scenes. Our key idea to solve this challenging problem is to use a semantic field as the intermediate representation, which is easier to reconstruct from an input semantic mask and then translate to a radiance field with the assistance of off-the-shelf semantic image synthesis models. Experiments show that our method outperforms baseline methods and produces photorealistic, multi-view consistent videos of a variety of natural scenes.
