Table of Contents
Fetching ...

MESA: Text-Driven Terrain Generation Using Latent Diffusion and Global Copernicus Data

Paul Borne--Pons, Mikolaj Czerkawski, Rosalie Martin, Romain Rouffet

TL;DR

MESA introduces a data-centric approach to terrain generation by training a latent diffusion model on global remote sensing data to produce 2.5D terrains (RGB and depth) from text prompts. The method combines a Stable Diffusion 2.1 backbone with a joint RGB-DEM representation, using captions derived from geographic coordinates to condition generation and masking to produce cloud-free outputs. A major contribution is the Major TOM Core-DEM dataset extension and its open release, enabling global, dense terrain coverage for training. The work demonstrates qualitative evidence of diverse, realistic terrain synthesis and highlights the potential of data-driven, remote-sensing conditioned generation for scalable terrain modeling in games and simulations, along with practical considerations like shadow correction.

Abstract

Terrain modeling has traditionally relied on procedural techniques, which often require extensive domain expertise and handcrafted rules. In this paper, we present MESA - a novel data-centric alternative by training a diffusion model on global remote sensing data. This approach leverages large-scale geospatial information to generate high-quality terrain samples from text descriptions, showcasing a flexible and scalable solution for terrain generation. The model's capabilities are demonstrated through extensive experiments, highlighting its ability to generate realistic and diverse terrain landscapes. The dataset produced to support this work, the Major TOM Core-DEM extension dataset, is released openly as a comprehensive resource for global terrain data. The results suggest that data-driven models, trained on remote sensing data, can provide a powerful tool for realistic terrain modeling and generation.

MESA: Text-Driven Terrain Generation Using Latent Diffusion and Global Copernicus Data

TL;DR

MESA introduces a data-centric approach to terrain generation by training a latent diffusion model on global remote sensing data to produce 2.5D terrains (RGB and depth) from text prompts. The method combines a Stable Diffusion 2.1 backbone with a joint RGB-DEM representation, using captions derived from geographic coordinates to condition generation and masking to produce cloud-free outputs. A major contribution is the Major TOM Core-DEM dataset extension and its open release, enabling global, dense terrain coverage for training. The work demonstrates qualitative evidence of diverse, realistic terrain synthesis and highlights the potential of data-driven, remote-sensing conditioned generation for scalable terrain modeling in games and simulations, along with practical considerations like shadow correction.

Abstract

Terrain modeling has traditionally relied on procedural techniques, which often require extensive domain expertise and handcrafted rules. In this paper, we present MESA - a novel data-centric alternative by training a diffusion model on global remote sensing data. This approach leverages large-scale geospatial information to generate high-quality terrain samples from text descriptions, showcasing a flexible and scalable solution for terrain generation. The model's capabilities are demonstrated through extensive experiments, highlighting its ability to generate realistic and diverse terrain landscapes. The dataset produced to support this work, the Major TOM Core-DEM extension dataset, is released openly as a comprehensive resource for global terrain data. The results suggest that data-driven models, trained on remote sensing data, can provide a powerful tool for realistic terrain modeling and generation.

Paper Structure

This paper contains 16 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: MESA is a novel generative model based on latent denoising diffusion capable of generating 2.5D representations of terrain based on the text prompt conditioning supplied via natural language. The model produces two co-registered modalities of optical and depth maps.
  • Figure 2: Coverage of the dataset used for this work. Every pixel corresponds to a single cell on the Major TOM grid (10 km). Green marks regions with only Sentinel-2 images available, while blue indicates those with only DEM. Black indicates the absence of any data, while the land and water colors represent the presence of both modalities.
  • Figure 3: Using Stable Diffusion 2.1 weights, we project RGB and depth maps into latent space with frozen VAE encoders. Latents are noised and denoised conditionally on captions via a modified U-Net. We mask the loss with $\mathbf{z}_M$ to focus on cloud-free pixels, enabling cloud-free terrain generation.