Table of Contents
Fetching ...

UV-SAM: Adapting Segment Anything Model for Urban Village Identification

Xin Zhang, Yu Liu, Yuming Lin, Qingmin Liao, Yong Li

TL;DR

UV-SAM tackles the challenge of delineating urban village boundaries from satellite imagery. It introduces a generalist-specialist framework that couples the Segment Anything Model (SAM) with a lightweight SegFormer to generate urban-village–specific prompts across four categories, enabling precise boundary segmentation. On Beijing and Xi'an datasets, UV-SAM achieves state-of-the-art IoU and F1 scores, highlighting the importance of prompt design and multi-modal prompt fusion. The results yield insights into the spatial distribution and temporal trends of urban villages, demonstrating the potential of vision foundation models for sustainable cities.

Abstract

Urban villages, defined as informal residential areas in or around urban centers, are characterized by inadequate infrastructures and poor living conditions, closely related to the Sustainable Development Goals (SDGs) on poverty, adequate housing, and sustainable cities. Traditionally, governments heavily depend on field survey methods to monitor the urban villages, which however are time-consuming, labor-intensive, and possibly delayed. Thanks to widely available and timely updated satellite images, recent studies develop computer vision techniques to detect urban villages efficiently. However, existing studies either focus on simple urban village image classification or fail to provide accurate boundary information. To accurately identify urban village boundaries from satellite images, we harness the power of the vision foundation model and adapt the Segment Anything Model (SAM) to urban village segmentation, named UV-SAM. Specifically, UV-SAM first leverages a small-sized semantic segmentation model to produce mixed prompts for urban villages, including mask, bounding box, and image representations, which are then fed into SAM for fine-grained boundary identification. Extensive experimental results on two datasets in China demonstrate that UV-SAM outperforms existing baselines, and identification results over multiple years show that both the number and area of urban villages are decreasing over time, providing deeper insights into the development trends of urban villages and sheds light on the vision foundation models for sustainable cities. The dataset and codes of this study are available at https://github.com/tsinghua-fib-lab/UV-SAM.

UV-SAM: Adapting Segment Anything Model for Urban Village Identification

TL;DR

UV-SAM tackles the challenge of delineating urban village boundaries from satellite imagery. It introduces a generalist-specialist framework that couples the Segment Anything Model (SAM) with a lightweight SegFormer to generate urban-village–specific prompts across four categories, enabling precise boundary segmentation. On Beijing and Xi'an datasets, UV-SAM achieves state-of-the-art IoU and F1 scores, highlighting the importance of prompt design and multi-modal prompt fusion. The results yield insights into the spatial distribution and temporal trends of urban villages, demonstrating the potential of vision foundation models for sustainable cities.

Abstract

Urban villages, defined as informal residential areas in or around urban centers, are characterized by inadequate infrastructures and poor living conditions, closely related to the Sustainable Development Goals (SDGs) on poverty, adequate housing, and sustainable cities. Traditionally, governments heavily depend on field survey methods to monitor the urban villages, which however are time-consuming, labor-intensive, and possibly delayed. Thanks to widely available and timely updated satellite images, recent studies develop computer vision techniques to detect urban villages efficiently. However, existing studies either focus on simple urban village image classification or fail to provide accurate boundary information. To accurately identify urban village boundaries from satellite images, we harness the power of the vision foundation model and adapt the Segment Anything Model (SAM) to urban village segmentation, named UV-SAM. Specifically, UV-SAM first leverages a small-sized semantic segmentation model to produce mixed prompts for urban villages, including mask, bounding box, and image representations, which are then fed into SAM for fine-grained boundary identification. Extensive experimental results on two datasets in China demonstrate that UV-SAM outperforms existing baselines, and identification results over multiple years show that both the number and area of urban villages are decreasing over time, providing deeper insights into the development trends of urban villages and sheds light on the vision foundation models for sustainable cities. The dataset and codes of this study are available at https://github.com/tsinghua-fib-lab/UV-SAM.
Paper Structure (35 sections, 6 equations, 13 figures, 4 tables)

This paper contains 35 sections, 6 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Examples of urban villages identified from satellite images, with appearance characteristics provided. The red part represents the urban village areas.
  • Figure 2: The illustration of proposed UV-SAM framework. The snowflake and torch symbols in the figure signify that the model parameters in this part are kept frozen and learnable, respectively.
  • Figure 3: Mask quality rating distributions by datasets from our human evaluation study in Beijing and Xi'an, with average scores shown in the legend.
  • Figure 4: Urban village (UV) distribution in Beijing in 2020.
  • Figure 5: Urban village (UV) distribution along Beijing's ring roads with respect to area and amount in 2020.
  • ...and 8 more figures