Table of Contents
Fetching ...

GroundUp: Rapid Sketch-Based 3D City Massing

Gizem Esra Unlu, Mohamed Sayed, Yulia Gryaditskaya, Gabriel Brostow

TL;DR

GroundUp introduces the first sketch-based tool for rapid 3D city massing, enabling designers to iteratively switch between a plan sketch and a perspective sketch to quickly infer a heightfield-based city model. The method combines a top-down occupancy mask predictor, a perspective-depth predictor conditioned on occupancy cues, and a multi-conditional latent diffusion model to complete coherent 3D geometry aligned with both sketches. Key contributions include (i) a novel perspective-depth network that leverages top-down scene cues, (ii) a diffusion-based top-down heightfield completion conditioned on dual-view sketches, and (iii) a tight human-centered interface validated by a user study with architects and novices. The results demonstrate fast initial massing prototyping with plausible, editable 3D outputs suitable for downstream architectural design, while pointing to future work on multi-view integration, editable meshes, and expanded scene elements.

Abstract

We propose GroundUp, the first sketch-based ideation tool for 3D city massing of urban areas. We focus on early-stage urban design, where sketching is a common tool and the design starts from balancing building volumes (masses) and open spaces. With Human-Centered AI in mind, we aim to help architects quickly revise their ideas by easily switching between 2D sketches and 3D models, allowing for smoother iteration and sharing of ideas. Inspired by feedback from architects and existing workflows, our system takes as a first input a user sketch of multiple buildings in a top-down view. The user then draws a perspective sketch of the envisioned site. Our method is designed to exploit the complementarity of information in the two sketches and allows users to quickly preview and adjust the inferred 3D shapes. Our model has two main components. First, we propose a novel sketch-to-depth prediction network for perspective sketches that exploits top-down sketch shapes. Second, we use depth cues derived from the perspective sketch as a condition to our diffusion model, which ultimately completes the geometry in a top-down view. Thus, our final 3D geometry is represented as a heightfield, allowing users to construct the city `from the ground up'.

GroundUp: Rapid Sketch-Based 3D City Massing

TL;DR

GroundUp introduces the first sketch-based tool for rapid 3D city massing, enabling designers to iteratively switch between a plan sketch and a perspective sketch to quickly infer a heightfield-based city model. The method combines a top-down occupancy mask predictor, a perspective-depth predictor conditioned on occupancy cues, and a multi-conditional latent diffusion model to complete coherent 3D geometry aligned with both sketches. Key contributions include (i) a novel perspective-depth network that leverages top-down scene cues, (ii) a diffusion-based top-down heightfield completion conditioned on dual-view sketches, and (iii) a tight human-centered interface validated by a user study with architects and novices. The results demonstrate fast initial massing prototyping with plausible, editable 3D outputs suitable for downstream architectural design, while pointing to future work on multi-view integration, editable meshes, and expanded scene elements.

Abstract

We propose GroundUp, the first sketch-based ideation tool for 3D city massing of urban areas. We focus on early-stage urban design, where sketching is a common tool and the design starts from balancing building volumes (masses) and open spaces. With Human-Centered AI in mind, we aim to help architects quickly revise their ideas by easily switching between 2D sketches and 3D models, allowing for smoother iteration and sharing of ideas. Inspired by feedback from architects and existing workflows, our system takes as a first input a user sketch of multiple buildings in a top-down view. The user then draws a perspective sketch of the envisioned site. Our method is designed to exploit the complementarity of information in the two sketches and allows users to quickly preview and adjust the inferred 3D shapes. Our model has two main components. First, we propose a novel sketch-to-depth prediction network for perspective sketches that exploits top-down sketch shapes. Second, we use depth cues derived from the perspective sketch as a condition to our diffusion model, which ultimately completes the geometry in a top-down view. Thus, our final 3D geometry is represented as a heightfield, allowing users to construct the city `from the ground up'.
Paper Structure (49 sections, 9 equations, 11 figures, 6 tables)

This paper contains 49 sections, 9 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: An illustrative example of our method. (0) Users of our web-based GroundUp system can optionally load registered maps, satellite images, or perspective photographs as underlay layers. These give context for the "massing" process. A blank underlay is used in this example. (1) (bottom) The user sketches the initial footprints of multiple buildings in a top-down view. These strokes are projected into the perspective-view canvas (top). (2) The user sketches a perspective view of the site, and then (3) they trigger our trained model to infer the 3D shape of the sketched buildings. The user can then refine their ideas, iterating between 2D sketching and 3D visuals.
  • Figure 2: Reconstruction pipeline overview. (I.) From input sketches, (II.) we estimate the segmentation of the top-down sketch into individual buildings (as detailed in \ref{['sec:topdown_mask']}). (III.) We then inject the volumetric information about the spaces not occupied by buildings (based on the segmentation result and using a known perspective camera from our interface) into the network that predicts depth and a foreground mask for the perspective sketch view (further detailed in \ref{['sec:perspective_depth']}). (IV.) From the predicted depth values, we obtain a partial 3D point cloud of the user-envisioned 3D city block. (V) By projecting a sparse 3D prediction into a top-down view, we obtain an initial guess for a top-down view heightfield. Finally, we rely on a diffusion model to obtain a plausible 3D reconstruction that aligns with the perspective and top-down sketches (as shown in V-VI. and detailed in \ref{['sec:diffusion_completion']}).
  • Figure 3: Qualitative evaluation of the perspective depth estimation.$Mono$ stands for a monocular depth predictor baseline by Sayed et al. sayed22eccv. $OV$ represents our model with the occupancy volume, obtained as described in \ref{['sec:net_design_depth_p']}. Grey mesh corresponds to the geometry obtained from the ground-truth heightfield. Point clouds represent the estimated depth values from a perspective sketch. Colors encode the distance from a camera. Our prediction visually aligns better with the ground-truth.
  • Figure 4: Role of the normal loss.a) Visibility regions (red points) are computed based on ground-truth geometry and the perspective sketch viewpoint. b) Prediction when the normal loss is used: The red point cloud is riding slightly above the green prediction. As shown in view 2 the height is slightly underestimated in the visible regions, but the loss results in more even roofs overall. c) Prediction when the normal loss is not used: the model produces blobby building geometry outside the visible regions.
  • Figure 5: Qualitative evaluation on synthetic sketches. (a) and (b) show example top-down and perspective sketches. (c) and (f) show reconstruction results obtained with the HeightFieldswatson2023wacv method, which is trained and tested on the same data as our method. (d) and (g) show reconstruction results by our method. (e) and (h) show the heightfield of the ground-truth top-down depth map. Note that the colors are assigned according to the ground-truth segmentation of buildings. Please zoom in to better see the alignment of predicted geometries with the ground-truth buildings' areas.
  • ...and 6 more figures