Table of Contents
Fetching ...

Scene-Aware Urban Design: A Human-AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models

Rodrigo Gallardo, Oz Fishman, Alexander Htet Kyaw

TL;DR

This work addresses the challenge of incorporating residents' everyday knowledge into urban design by proposing a human-in-the-loop framework that surfaces micro-scale interventions anchored to observable objects. It fuses lightweight object detection on ADE20K (via GroundingDINO), co-occurrence embeddings to capture common spatial configurations, and a vision-language model to reason over scenes and propose additional contextually relevant objects, with 3D mesh previews for AR integration. The key contributions include a co-occurrence-based contextual embedding, a two-branch approach combining statistical co-occurrence with semantic VLM reasoning, and an end-to-end AR-enabled workflow for interactive design exploration. By grounding AI suggestions in real-world patterns and user input, the framework aims to democratize urban design and extend participatory decision-making, while acknowledging limitations in 2D spatial reasoning and detection noise.

Abstract

This paper introduces a human-in-the-loop computer vision framework that uses generative AI to propose micro-scale design interventions in public space and support more continuous, local participation. Using Grounding DINO and a curated subset of the ADE20K dataset as a proxy for the urban built environment, the system detects urban objects and builds co-occurrence embeddings that reveal common spatial configurations. From this analysis, the user receives five statistically likely complements to a chosen anchor object. A vision language model then reasons over the scene image and the selected pair to suggest a third object that completes a more complex urban tactic. The workflow keeps people in control of selection and refinement and aims to move beyond top-down master planning by grounding choices in everyday patterns and lived experience.

Scene-Aware Urban Design: A Human-AI Recommendation Framework Using Co-Occurrence Embeddings and Vision-Language Models

TL;DR

This work addresses the challenge of incorporating residents' everyday knowledge into urban design by proposing a human-in-the-loop framework that surfaces micro-scale interventions anchored to observable objects. It fuses lightweight object detection on ADE20K (via GroundingDINO), co-occurrence embeddings to capture common spatial configurations, and a vision-language model to reason over scenes and propose additional contextually relevant objects, with 3D mesh previews for AR integration. The key contributions include a co-occurrence-based contextual embedding, a two-branch approach combining statistical co-occurrence with semantic VLM reasoning, and an end-to-end AR-enabled workflow for interactive design exploration. By grounding AI suggestions in real-world patterns and user input, the framework aims to democratize urban design and extend participatory decision-making, while acknowledging limitations in 2D spatial reasoning and detection noise.

Abstract

This paper introduces a human-in-the-loop computer vision framework that uses generative AI to propose micro-scale design interventions in public space and support more continuous, local participation. Using Grounding DINO and a curated subset of the ADE20K dataset as a proxy for the urban built environment, the system detects urban objects and builds co-occurrence embeddings that reveal common spatial configurations. From this analysis, the user receives five statistically likely complements to a chosen anchor object. A vision language model then reasons over the scene image and the selected pair to suggest a third object that completes a more complex urban tactic. The workflow keeps people in control of selection and refinement and aims to move beyond top-down master planning by grounding choices in everyday patterns and lived experience.

Paper Structure

This paper contains 16 sections, 2 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Overview Diagram
  • Figure 2: ADE20K image with listed objects and their parts
  • Figure 3: 900 Image Co-Occurrence Matrix
  • Figure 4: Pilot Interface in Urban Scene
  • Figure 5: Mesh generated using VLM generted recommendation and discription
  • ...and 6 more figures