Table of Contents
Fetching ...

UrbanVGGT: Scalable Sidewalk Width Estimation from Street View Images

Kaizhen Tan, Fan Zhang

Abstract

Sidewalk width is an important indicator of pedestrian accessibility, comfort, and network quality, yet large-scale width data remain scarce in most cities. Existing approaches typically rely on costly field surveys, high-resolution overhead imagery, or simplified geometric assumptions that limit scalability or introduce systematic error. To address this gap, we present UrbanVGGT, a measurement pipeline for estimating metric sidewalk width from a single street-view image. The method combines semantic segmentation, feed-forward 3D reconstruction, adaptive ground-plane fitting, camera-height-based scale calibration, and directional width measurement on the recovered plane. On a ground-truth benchmark from Washington, D.C., UrbanVGGT achieves a mean absolute error of 0.252 m, with 95.5% of estimates within 0.50 m of the reference width. Ablation experiments show that metric scale calibration is the most critical component, and controlled comparisons with alternative geometry backbones support the effectiveness of the overall design. As a feasibility demonstration, we further apply the pipeline to three cities and generate SV-SideWidth, a prototype sidewalk-width dataset covering 527 OpenStreetMap street segments. The results indicate that street-view imagery can support scalable generation of candidate sidewalk-width attributes, while broader cross-city validation and local ground-truth auditing remain necessary before deployment as authoritative planning data.

UrbanVGGT: Scalable Sidewalk Width Estimation from Street View Images

Abstract

Sidewalk width is an important indicator of pedestrian accessibility, comfort, and network quality, yet large-scale width data remain scarce in most cities. Existing approaches typically rely on costly field surveys, high-resolution overhead imagery, or simplified geometric assumptions that limit scalability or introduce systematic error. To address this gap, we present UrbanVGGT, a measurement pipeline for estimating metric sidewalk width from a single street-view image. The method combines semantic segmentation, feed-forward 3D reconstruction, adaptive ground-plane fitting, camera-height-based scale calibration, and directional width measurement on the recovered plane. On a ground-truth benchmark from Washington, D.C., UrbanVGGT achieves a mean absolute error of 0.252 m, with 95.5% of estimates within 0.50 m of the reference width. Ablation experiments show that metric scale calibration is the most critical component, and controlled comparisons with alternative geometry backbones support the effectiveness of the overall design. As a feasibility demonstration, we further apply the pipeline to three cities and generate SV-SideWidth, a prototype sidewalk-width dataset covering 527 OpenStreetMap street segments. The results indicate that street-view imagery can support scalable generation of candidate sidewalk-width attributes, while broader cross-city validation and local ground-truth auditing remain necessary before deployment as authoritative planning data.
Paper Structure (32 sections, 4 equations, 6 figures, 7 tables)

This paper contains 32 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Sidewalk width data in OpenStreetMap. Grey lines denote all drivable streets; blue lines (if any) denote streets with a sidewalk-width tag. Both New York City (461 streets) and Nairobi (1958 streets) have zero sidewalk-width tags, highlighting the data gap that UrbanVGGT aims to fill.
  • Figure 2: UrbanVGGT pipeline overview. (a) Input street-view image. (b) Semantic segmentation with inner (yellow) and outer (red) boundary detection. (c) Midline overlap region used to pair boundary points. (d) VGGT-based 3D reconstruction. (e) Ground-plane fitting with semantic point cloud. (f) Width estimation and preliminary dataset construction.
  • Figure 3: Qualitative measurement examples on the D.C. dataset. Each panel shows the segmentation overlay with detected inner (yellow) and outer (red) boundaries. Predicted width: model estimate; ground-truth width: reference measurement.
  • Figure 4: Camera height sensitivity: MAE as a function of the assumed camera mounting height $h_{\mathrm{cam}}$.
  • Figure 5: MAE comparison across all methods. Models are grouped by evaluation category: Category 1 (metric depth with native scale), Category 2 (monocular depth with pinhole unprojection and scale calibration), and Category 3 (single-image point-cloud reconstruction with scale calibration). All methods share the same segmentation, boundary extraction, plane fitting, and outlier filtering; only the 3D geometry backbone differs. The dashed red line indicates the UrbanVGGT MAE (0.252 m).
  • ...and 1 more figures