Table of Contents
Fetching ...

Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark

Rashid Mushkani

TL;DR

This work introduces a community-grounded urban-perception benchmark to evaluate vision-language systems on 100 Montreal street images across 30 dimensions, contrasting objective visual attributes with subjective appraisals. A deterministic zero-shot evaluation harness and French–English normalization enable reproducible scoring (accuracy for single-label and Jaccard for multi-label items) across seven diverse systems, with human reliability measured via Krippendorff's α and pairwise Jaccard. Results show stronger model alignment for visually grounded properties than for subjective judgments; model scores co-vary with human agreement, and there is only a modest gap between real and synthetic scenes. The study highlights practical avenues for uncertainty-aware, participatory urban analysis and provides a scalable, transparent framework for future research on human–AI alignment in urban perception.

Abstract

Understanding how people read city scenes can inform design and planning. We introduce a small benchmark for testing vision-language models (VLMs) on urban perception using 100 Montreal street images, evenly split between photographs and photorealistic synthetic scenes. Twelve participants from seven community groups supplied 230 annotation forms across 30 dimensions mixing physical attributes and subjective impressions. French responses were normalized to English. We evaluated seven VLMs in a zero-shot setup with a structured prompt and deterministic parser. We use accuracy for single-choice items and Jaccard overlap for multi-label items; human agreement uses Krippendorff's alpha and pairwise Jaccard. Results suggest stronger model alignment on visible, objective properties than subjective appraisals. The top system (claude-sonnet) reaches macro 0.31 and mean Jaccard 0.48 on multi-label items. Higher human agreement coincides with better model scores. Synthetic images slightly lower scores. We release the benchmark, prompts, and harness for reproducible, uncertainty-aware evaluation in participatory urban analysis.

Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark

TL;DR

This work introduces a community-grounded urban-perception benchmark to evaluate vision-language systems on 100 Montreal street images across 30 dimensions, contrasting objective visual attributes with subjective appraisals. A deterministic zero-shot evaluation harness and French–English normalization enable reproducible scoring (accuracy for single-label and Jaccard for multi-label items) across seven diverse systems, with human reliability measured via Krippendorff's α and pairwise Jaccard. Results show stronger model alignment for visually grounded properties than for subjective judgments; model scores co-vary with human agreement, and there is only a modest gap between real and synthetic scenes. The study highlights practical avenues for uncertainty-aware, participatory urban analysis and provides a scalable, transparent framework for future research on human–AI alignment in urban perception.

Abstract

Understanding how people read city scenes can inform design and planning. We introduce a small benchmark for testing vision-language models (VLMs) on urban perception using 100 Montreal street images, evenly split between photographs and photorealistic synthetic scenes. Twelve participants from seven community groups supplied 230 annotation forms across 30 dimensions mixing physical attributes and subjective impressions. French responses were normalized to English. We evaluated seven VLMs in a zero-shot setup with a structured prompt and deterministic parser. We use accuracy for single-choice items and Jaccard overlap for multi-label items; human agreement uses Krippendorff's alpha and pairwise Jaccard. Results suggest stronger model alignment on visible, objective properties than subjective appraisals. The top system (claude-sonnet) reaches macro 0.31 and mean Jaccard 0.48 on multi-label items. Higher human agreement coincides with better model scores. Synthetic images slightly lower scores. We release the benchmark, prompts, and harness for reproducible, uncertainty-aware evaluation in participatory urban analysis.

Paper Structure

This paper contains 37 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Self-reported participant context (counts). These categories were optional and are reported only in aggregate to characterize the annotator pool. Categories are not mutually exclusive and reflect intersectional identities; participants could select multiple identity markers, so counts exceed the number of participants.
  • Figure 2: Overall agreement with human consensus by model. Macro-averaged accuracy (single-choice) and Jaccard (multi-label) across 30 dimensions.
  • Figure 3: Mean Jaccard set overlap between model selections and the human consensus for the multi-label dimensions.
  • Figure 4: Difficulty by dimension. Each point is the mean model score for the dimension (accuracy or Jaccard depending on the item type).
  • Figure 5: Agreement by dimension and model. Warmer colors indicate higher agreement with human consensus (accuracy for single-choice, Jaccard for multi-label).
  • ...and 5 more figures