Table of Contents
Fetching ...

MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion

Xuyang Chen, Zhijun Zhai, Kaixuan Zhou, Zengmao Wang, Jianan He, Dong Wang, Yanfeng Zhang, mingwei Sun, Rüdiger Westermann, Konrad Schindler, Liqiu Meng

TL;DR

MeSS introduces a geometry-anchored diffusion pipeline for outdoor scene generation on city meshes, addressing both precise geometric alignment and cross-view consistency. It combines two Cascaded ControlNets to generate sparse key views, a Gaussian surfel-based 3D Gaussian field projected onto the mesh, and Stage II densification via Appearance Guided Inpainting with Latent Consistency Model constraints, followed by Global Consistency Alignment to harmonize exposures. The approach yields higher geometric fidelity and view-consistency than baselines, with additional capability for stylized rendering through relighting and diffusion-guided style transfer. The methodology offers scalable texture augmentation for mesh-based city scenes with low training cost and broad applicability to virtual navigation and autonomous driving simulations.

Abstract

Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving. To address this, this paper proposes MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior. While image and video diffusion models can leverage spatial layouts (such as depth maps or HD maps) as control conditions to generate street-level perspective views, they are not directly applicable to 3D scene generation. Video diffusion models excel at synthesizing consistent view sequences that depict scenes but often struggle to adhere to predefined camera paths or align accurately with rendered control videos. In contrast, image diffusion models, though unable to guarantee cross-view visual consistency, can produce more geometry-aligned results when combined with ControlNet. Building on this insight, our approach enhances image diffusion models by improving cross-view consistency. The pipeline comprises three key stages: first, we generate geometrically consistent sparse views using Cascaded Outpainting ControlNets; second, we propagate denser intermediate views via a component dubbed AGInpaint; and third, we globally eliminate visual inconsistencies (e.g., varying exposure) using the GCAlign module. Concurrently with generation, a 3D Gaussian Splatting (3DGS) scene is reconstructed by initializing Gaussian balls on the mesh surface. Our method outperforms existing approaches in both geometric alignment and generation quality. Once synthesized, the scene can be rendered in diverse styles through relighting and style transfer techniques. project page: https://albertchen98.github.io/mess/

MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion

TL;DR

MeSS introduces a geometry-anchored diffusion pipeline for outdoor scene generation on city meshes, addressing both precise geometric alignment and cross-view consistency. It combines two Cascaded ControlNets to generate sparse key views, a Gaussian surfel-based 3D Gaussian field projected onto the mesh, and Stage II densification via Appearance Guided Inpainting with Latent Consistency Model constraints, followed by Global Consistency Alignment to harmonize exposures. The approach yields higher geometric fidelity and view-consistency than baselines, with additional capability for stylized rendering through relighting and diffusion-guided style transfer. The methodology offers scalable texture augmentation for mesh-based city scenes with low training cost and broad applicability to virtual navigation and autonomous driving simulations.

Abstract

Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving. To address this, this paper proposes MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior. While image and video diffusion models can leverage spatial layouts (such as depth maps or HD maps) as control conditions to generate street-level perspective views, they are not directly applicable to 3D scene generation. Video diffusion models excel at synthesizing consistent view sequences that depict scenes but often struggle to adhere to predefined camera paths or align accurately with rendered control videos. In contrast, image diffusion models, though unable to guarantee cross-view visual consistency, can produce more geometry-aligned results when combined with ControlNet. Building on this insight, our approach enhances image diffusion models by improving cross-view consistency. The pipeline comprises three key stages: first, we generate geometrically consistent sparse views using Cascaded Outpainting ControlNets; second, we propagate denser intermediate views via a component dubbed AGInpaint; and third, we globally eliminate visual inconsistencies (e.g., varying exposure) using the GCAlign module. Concurrently with generation, a 3D Gaussian Splatting (3DGS) scene is reconstructed by initializing Gaussian balls on the mesh surface. Our method outperforms existing approaches in both geometric alignment and generation quality. Once synthesized, the scene can be rendered in diverse styles through relighting and style transfer techniques. project page: https://albertchen98.github.io/mess/

Paper Structure

This paper contains 30 sections, 6 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Starting from textureless urban meshes, our MeSS synthesizes high-quality Gaussian Splatting Scenes with realistic appearance. After synthesis, these Gaussian scenes can be further rendered into stylized videos.
  • Figure 2: Schematic illustration of MeSS. Given a sequence of camera poses, we start by generating the last viewpoint using a ControlNet-s. Then we generate other key views in reverse order using a ControlNet-n, while transferring information backwards through the sequence. All generated pixels are projected onto the mesh surface as 2D Gaussian surfels. From the resulting Gaussian field, intermediate views are rendered and filled up with Appearance-Guided Inpainting (AGInpaint), simultaneously densifying the Gaussian field. Each time the field is extended, a Global Consistency Alignment ensures spatial consistency by simultaneously denoising multi-view renderings.
  • Figure 3: Silhouettes on novel views, marked with red ellipses and arrows.
  • Figure 4: a) The comparison result of Resample(left) vs. AGInpaint(right). AGInpaint performs better than Resample in slim region inpainting b) The comparison of results w/o(left) and w/(right) GCAlign. GCAlign is able to harmonize the seams brought by different exposures
  • Figure 5: Visual comparison with other methods. Since there is no code provided by CityDreamer4D and Streetscapes, we take the visual results from their papers. Please zoom in to check for details.
  • ...and 3 more figures