Table of Contents
Fetching ...

Infinite Leagues Under the Sea: Photorealistic 3D Underwater Terrain Generation by Latent Fractal Diffusion Models

Tianyi Zhang, Weiming Zhi, Joshua Mangelson, Matthew Johnson-Roberson

TL;DR

DreamSea tackles the challenge of photorealistic 3D underwater terrain generation from unannotated RGB imagery by conditioning a diffusion model on fractal latent embeddings derived from a Diamond-Square process and on zero-shot features from visual foundation models. It integrates depth inference via Depth Anything v2, RGBD generation, and 3D Gaussian Splatting (3DGS) with Score Distillation Sampling to render consistent novel views, producing large-scale, diverse underwater scenes. Key contributions include fractal latent terrain control, zero-shot feature conditioning with DINOv2 and PCA, and a diffusion-guided RGBD-to-3D pipeline that yields spatially coherent 3D terrains suitable for filming, gaming, and robot simulation. The approach demonstrates robust realism and consistency across multiple real-world datasets while outlining paths toward underwater simulation environments, albeit with limitations in metric scaling and viewing-angle diversity.

Abstract

This paper tackles the problem of generating representations of underwater 3D terrain. Off-the-shelf generative models, trained on Internet-scale data but not on specialized underwater images, exhibit downgraded realism, as images of the seafloor are relatively uncommon. To this end, we introduce DreamSea, a generative model to generate hyper-realistic underwater scenes. DreamSea is trained on real-world image databases collected from underwater robot surveys. Images from these surveys contain massive real seafloor observations and covering large areas, but are prone to noise and artifacts from the real world. We extract 3D geometry and semantics from the data with visual foundation models, and train a diffusion model that generates realistic seafloor images in RGBD channels, conditioned on novel fractal distribution-based latent embeddings. We then fuse the generated images into a 3D map, building a 3DGS model supervised by 2D diffusion priors which allows photorealistic novel view rendering. DreamSea is rigorously evaluated, demonstrating the ability to robustly generate large-scale underwater scenes that are consistent, diverse, and photorealistic. Our work drives impact in multiple domains, spanning filming, gaming, and robot simulation.

Infinite Leagues Under the Sea: Photorealistic 3D Underwater Terrain Generation by Latent Fractal Diffusion Models

TL;DR

DreamSea tackles the challenge of photorealistic 3D underwater terrain generation from unannotated RGB imagery by conditioning a diffusion model on fractal latent embeddings derived from a Diamond-Square process and on zero-shot features from visual foundation models. It integrates depth inference via Depth Anything v2, RGBD generation, and 3D Gaussian Splatting (3DGS) with Score Distillation Sampling to render consistent novel views, producing large-scale, diverse underwater scenes. Key contributions include fractal latent terrain control, zero-shot feature conditioning with DINOv2 and PCA, and a diffusion-guided RGBD-to-3D pipeline that yields spatially coherent 3D terrains suitable for filming, gaming, and robot simulation. The approach demonstrates robust realism and consistency across multiple real-world datasets while outlining paths toward underwater simulation environments, albeit with limitations in metric scaling and viewing-angle diversity.

Abstract

This paper tackles the problem of generating representations of underwater 3D terrain. Off-the-shelf generative models, trained on Internet-scale data but not on specialized underwater images, exhibit downgraded realism, as images of the seafloor are relatively uncommon. To this end, we introduce DreamSea, a generative model to generate hyper-realistic underwater scenes. DreamSea is trained on real-world image databases collected from underwater robot surveys. Images from these surveys contain massive real seafloor observations and covering large areas, but are prone to noise and artifacts from the real world. We extract 3D geometry and semantics from the data with visual foundation models, and train a diffusion model that generates realistic seafloor images in RGBD channels, conditioned on novel fractal distribution-based latent embeddings. We then fuse the generated images into a 3D map, building a 3DGS model supervised by 2D diffusion priors which allows photorealistic novel view rendering. DreamSea is rigorously evaluated, demonstrating the ability to robustly generate large-scale underwater scenes that are consistent, diverse, and photorealistic. Our work drives impact in multiple domains, spanning filming, gaming, and robot simulation.

Paper Structure

This paper contains 21 sections, 5 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Underwater 3D terrain generation: Given 2D images of the real world seafloor collected by robots, DreamSea distills 3D geometry and semantic information from visual foundation models and trains a diffusion model that generates realistic 3D underwater scenes conditioned on latent embeddings from a fractal process. All images and maps shown above are synthesized with DreamSea.
  • Figure 2: Off-the-shelf solution for generating underwater scenes: ChatGPT and SORA are able to generate scenes with diverse appearances, but present heavy artificial effects even though prompted with the "photorealistic style" keyword. Simulation environments song2025oceansimgpuacceleratedunderwaterrobot based on classic rendering pipelines, e.g. UNav-Sim amer2023unav and Infinigen infinigen2023infinite, present limited performance when generating diverse and uncommon 3D assets.
  • Figure 3: Overview of Training: Given RGB-only images collected from underwater surveys, we generate depth channels and embeddings with visual foundation models depth_anything_v2oquab2024dinov2. A DDPM network is then trained with an RGBD image as input conditioned on embeddings.
  • Figure 4: Overview of Generation: Our approach generates fractal embedding with the diamond-square method first, then generates images conditioned on these embeddings. We use RePaint repaint2022 to stitch the images together into a dense RGBD map. The RGBD map can be converted into a 3D point cloud and initialized as a 3DGS model kerbl3Dgaussians. The 3DGS model is further refined with 2D diffusion priors using SDS loss allowing realistic rendering from novel views.
  • Figure 5: The Diamond-Square algorithm, which recursively interpolates on a spatial grid, is used to generate latent embeddings in our approach. The red arrows start from the vertices of the existing square and diamond shapes from the previous iteration, and point towards the new center points.
  • ...and 9 more figures