Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)

Bin Han; Yiwei Yang; Anat Caspi; Bill Howe

Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)

Bin Han, Yiwei Yang, Anat Caspi, Bill Howe

TL;DR

The paper addresses the high cost of obtaining comprehensive digital representations of the built environment for equitable urban mobility. It introduces a zero-shot annotation pipeline that combines a general segmentation model (SAM) with a vision-language model (GPT-4o) using Set-of-Mark prompting to produce bounding-box annotations from satellite imagery without fine-tuning. Results show direct prompting performs near zero, while segmented prompting yields $IoU$ in the roughly 0.24–0.42 range for features like stop lines and raised tables, demonstrating feasibility and guiding a broader research agenda. The work highlights the potential for scalable, low-cost urban feature annotation to support equity, accessibility, and safety, while outlining challenges in segmentation reliability, VLM behavior, and cross-city generalization that warrant further development.

Abstract

Equitable urban transportation applications require high-fidelity digital representations of the built environment: not just streets and sidewalks, but bike lanes, marked and unmarked crossings, curb ramps and cuts, obstructions, traffic signals, signage, street markings, potholes, and more. Direct inspections and manual annotations are prohibitively expensive at scale. Conventional machine learning methods require substantial annotated training data for adequate performance. In this paper, we consider vision language models as a mechanism for annotating diverse urban features from satellite images, reducing the dependence on human annotation to produce large training sets. While these models have achieved impressive results in describing common objects in images captured from a human perspective, their training sets are less likely to include strong signals for esoteric features in the built environment, and their performance in these settings is therefore unclear. We demonstrate proof-of-concept combining a state-of-the-art vision language model and variants of a prompting strategy that asks the model to consider segmented elements independently of the original image. Experiments on two urban features -- stop lines and raised tables -- show that while direct zero-shot prompting correctly annotates nearly zero images, the pre-segmentation strategies can annotate images with near 40% intersection-over-union accuracy. We describe how these results inform a new research agenda in automatic annotation of the built environment to improve equity, accessibility, and safety at broad scale and in diverse environments.

Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)

TL;DR

in the roughly 0.24–0.42 range for features like stop lines and raised tables, demonstrating feasibility and guiding a broader research agenda. The work highlights the potential for scalable, low-cost urban feature annotation to support equity, accessibility, and safety, while outlining challenges in segmentation reliability, VLM behavior, and cross-city generalization that warrant further development.

Abstract

Paper Structure (5 sections, 3 figures, 2 tables)

This paper contains 5 sections, 3 figures, 2 tables.

Introduction
Related Work
Prompting Procedure
Evaluation & Results
Challenges & Future Direction

Figures (3)

Figure 1: Pipeline of our proposed automated annotation process. Users input a pair of (satellite image, annotation guidance). The image will go through a set of processes including segmentation, filtering, and set-of-mark generation. Then the image and guidance will go through a vision-language model, the output of which is post-processed to produce the final annotation results. The procedure requires no fine-tuning, and can be applied on different features with minimal adjust on the guidance.
Figure 2: SoM generation scenarios: (a) Filtered candidates (b) No-Context: Candidate objects are presented separately in a new image. (c) In-Context: Candidate objects are labeled with numbers and bounding boxes within the original image.
Figure 3: Left -- Examples of annotated stop lines. Right -- Examples of annotated raised tables. Red regions in each image are the segmented objects. Green and yellow outlines indicate perfect and approximate annotations, respectively. A Red outline indicate inaccurate annotations.

Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)

TL;DR

Abstract

Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)

Authors

TL;DR

Abstract

Table of Contents

Figures (3)