Table of Contents
Fetching ...

Mitigating Urban-Rural Disparities in Contrastive Representation Learning with Satellite Imagery

Miao Zhang, Rumi Chunara

TL;DR

This work tackles urban-rural disparities in satellite image land-cover segmentation by learning fair, dense representations through FairDCL, a multi-level, locality-aware regularization added to contrastive self-supervised pre-training. By modeling and mitigating spurious features linked to urbanization at multiple feature-map levels via mutual information constraints, the approach reduces subgroup gaps in segmentation accuracy while preserving overall performance. Empirical results on LoveDA and EOLearn Slovenia show smaller Diff and higher Wst, with robust embedding-space analyses and ablations confirming the importance of multi-level de-biasing. The method is compatible with existing SSL frameworks and offers practical impact for equitable geographic analysis and policy-relevant tasks.

Abstract

Satellite imagery is being leveraged for many societally critical tasks across climate, economics, and public health. Yet, because of heterogeneity in landscapes (e.g. how a road looks in different places), models can show disparate performance across geographic areas. Given the important potential of disparities in algorithmic systems used in societal contexts, here we consider the risk of urban-rural disparities in identification of land-cover features. This is via semantic segmentation (a common computer vision task in which image regions are labelled according to what is being shown) which uses pre-trained image representations generated via contrastive self-supervised learning. We propose fair dense representation with contrastive learning (FairDCL) as a method for de-biasing the multi-level latent space of convolution neural network models. The method improves feature identification by removing spurious model representations which are disparately distributed across urban and rural areas, and is achieved in an unsupervised way by contrastive pre-training. The obtained image representation mitigates downstream urban-rural prediction disparities and outperforms state-of-the-art baselines on real-world satellite images. Embedding space evaluation and ablation studies further demonstrate FairDCL's robustness. As generalizability and robustness in geographic imagery is a nascent topic, our work motivates researchers to consider metrics beyond average accuracy in such applications.

Mitigating Urban-Rural Disparities in Contrastive Representation Learning with Satellite Imagery

TL;DR

This work tackles urban-rural disparities in satellite image land-cover segmentation by learning fair, dense representations through FairDCL, a multi-level, locality-aware regularization added to contrastive self-supervised pre-training. By modeling and mitigating spurious features linked to urbanization at multiple feature-map levels via mutual information constraints, the approach reduces subgroup gaps in segmentation accuracy while preserving overall performance. Empirical results on LoveDA and EOLearn Slovenia show smaller Diff and higher Wst, with robust embedding-space analyses and ablations confirming the importance of multi-level de-biasing. The method is compatible with existing SSL frameworks and offers practical impact for equitable geographic analysis and policy-relevant tasks.

Abstract

Satellite imagery is being leveraged for many societally critical tasks across climate, economics, and public health. Yet, because of heterogeneity in landscapes (e.g. how a road looks in different places), models can show disparate performance across geographic areas. Given the important potential of disparities in algorithmic systems used in societal contexts, here we consider the risk of urban-rural disparities in identification of land-cover features. This is via semantic segmentation (a common computer vision task in which image regions are labelled according to what is being shown) which uses pre-trained image representations generated via contrastive self-supervised learning. We propose fair dense representation with contrastive learning (FairDCL) as a method for de-biasing the multi-level latent space of convolution neural network models. The method improves feature identification by removing spurious model representations which are disparately distributed across urban and rural areas, and is achieved in an unsupervised way by contrastive pre-training. The obtained image representation mitigates downstream urban-rural prediction disparities and outperforms state-of-the-art baselines on real-world satellite images. Embedding space evaluation and ablation studies further demonstrate FairDCL's robustness. As generalizability and robustness in geographic imagery is a nascent topic, our work motivates researchers to consider metrics beyond average accuracy in such applications.
Paper Structure (23 sections, 6 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 6 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Model segmentation performance on urban and rural images of LoveDA wang2021loveda, measured by intersection-over-union (IoU). Two types of upstream feature encoders are used: (1) CNN encoder trained on unlabeled satellite images with contrastive self-supervised learning (SSL), and (2) pre-trained foundation model Segment Anything (SAM) kirillov2023segment. Urban-rural disparities are observed for land-cover classes with both encoders, and the disadvantaged groups are consistent across learning models.
  • Figure 2: Diagram of defined causal relationships between representation $X$ learnt with contrastive pre-training, target task prediction outputs $Y$, and urban/rural attribute $S$. $X$ contains two parts, $X_{spurious}$ generated from features spuriously correlated to $S$ and $X_{robust}$ generated from independent and unchangeable features. $U$ is unmeasured confounders which cause both $S$ and $X_{spurious}$ thus result in correlations between $S$ and $X_{spurious}$.
  • Figure 3: Examples of segmentation bias for "road" class due to spurious landscape features; the model segments certain patterns well, like straight and paved road (blue circles), but segments the variations poorly, like curvy and sand road (red circles).
  • Figure 4: Bias accumulation during contrastive pre-training. (A) Sum of mutual information estimation, and (B) the contrastive loss of ResNet50 model with MoCo-V2 pre-training. The baseline method with no intervention (Baseline), regularizing only on the global feature vector (Global only), first two layers of feature maps (First-two only), last two layers of feature maps (Last-two only) all show bias residuals compared to the multi-level method proposed as part of FairDCL.
  • Figure 5: Overview of FairDCL. It captures spurious information $X_{spurious}$ learnt by urban/rural discriminators, and applies regularization on image representations at multiple levels. We build one-hot feature maps to encode urban/rural attribute and estimate mutual information by neural discriminators. Penalty loss $\mathcal{L}_{D_i}$ are computed accordingly and added into the final contrastive pre-training objective.
  • ...and 2 more figures