MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion

Xuyang Chen; Zhijun Zhai; Kaixuan Zhou; Zengmao Wang; Jianan He; Dong Wang; Yanfeng Zhang; mingwei Sun; Rüdiger Westermann; Konrad Schindler; Liqiu Meng

MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion

Xuyang Chen, Zhijun Zhai, Kaixuan Zhou, Zengmao Wang, Jianan He, Dong Wang, Yanfeng Zhang, mingwei Sun, Rüdiger Westermann, Konrad Schindler, Liqiu Meng

TL;DR

MeSS introduces a geometry-anchored diffusion pipeline for outdoor scene generation on city meshes, addressing both precise geometric alignment and cross-view consistency. It combines two Cascaded ControlNets to generate sparse key views, a Gaussian surfel-based 3D Gaussian field projected onto the mesh, and Stage II densification via Appearance Guided Inpainting with Latent Consistency Model constraints, followed by Global Consistency Alignment to harmonize exposures. The approach yields higher geometric fidelity and view-consistency than baselines, with additional capability for stylized rendering through relighting and diffusion-guided style transfer. The methodology offers scalable texture augmentation for mesh-based city scenes with low training cost and broad applicability to virtual navigation and autonomous driving simulations.

Abstract

Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving. To address this, this paper proposes MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior. While image and video diffusion models can leverage spatial layouts (such as depth maps or HD maps) as control conditions to generate street-level perspective views, they are not directly applicable to 3D scene generation. Video diffusion models excel at synthesizing consistent view sequences that depict scenes but often struggle to adhere to predefined camera paths or align accurately with rendered control videos. In contrast, image diffusion models, though unable to guarantee cross-view visual consistency, can produce more geometry-aligned results when combined with ControlNet. Building on this insight, our approach enhances image diffusion models by improving cross-view consistency. The pipeline comprises three key stages: first, we generate geometrically consistent sparse views using Cascaded Outpainting ControlNets; second, we propagate denser intermediate views via a component dubbed AGInpaint; and third, we globally eliminate visual inconsistencies (e.g., varying exposure) using the GCAlign module. Concurrently with generation, a 3D Gaussian Splatting (3DGS) scene is reconstructed by initializing Gaussian balls on the mesh surface. Our method outperforms existing approaches in both geometric alignment and generation quality. Once synthesized, the scene can be rendered in diverse styles through relighting and style transfer techniques. project page: https://albertchen98.github.io/mess/

MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion

TL;DR

Abstract

MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)