Table of Contents
Fetching ...

Contrastive Pretraining for Visual Concept Explanations of Socioeconomic Outcomes

Ivica Obadic, Alex Levering, Lars Pennig, Dario Oliveira, Diego Marcos, Xiaoxiang Zhu

TL;DR

This work tackles the opacity of deep models predicting socioeconomic indicators from satellite imagery by proposing a post-hoc concept-explanation pipeline that orders latent representations by the target outcome using Rank-N-Contrast (RNC) pretraining, followed by a linear regressor and TCAV-based concept testing. The method yields a latent space that is continuously ordered with respect to the outcome, enabling concept explanations that cluster by outcome intervals and revealing urban patterns associated with different socioeconomic levels. On two geographies/tasks (income in France and liveability in the Netherlands), the approach improves predictive performance for income by about $0.10$ in $R^2$ and $0.08$ in Kendall's $\tau$, while providing interpretable insights into which concepts drive different outcome ranges. Crucially, it does not require location-specific concept labels, enabling cross-region applicability and providing urban-planning insights through concept sensitivities, such as the role of vegetation in higher-income or higher-liveability areas.

Abstract

Predicting socioeconomic indicators from satellite imagery with deep learning has become an increasingly popular research direction. Post-hoc concept-based explanations can be an important step towards broader adoption of these models in policy-making as they enable the interpretation of socioeconomic outcomes based on visual concepts that are intuitive to humans. In this paper, we study the interplay between representation learning using an additional task-specific contrastive loss and post-hoc concept explainability for socioeconomic studies. Our results on two different geographical locations and tasks indicate that the task-specific pretraining imposes a continuous ordering of the latent space embeddings according to the socioeconomic outcomes. This improves the model's interpretability as it enables the latent space of the model to associate concepts encoding typical urban and natural area patterns with continuous intervals of socioeconomic outcomes. Further, we illustrate how analyzing the model's conceptual sensitivity for the intervals of socioeconomic outcomes can shed light on new insights for urban studies.

Contrastive Pretraining for Visual Concept Explanations of Socioeconomic Outcomes

TL;DR

This work tackles the opacity of deep models predicting socioeconomic indicators from satellite imagery by proposing a post-hoc concept-explanation pipeline that orders latent representations by the target outcome using Rank-N-Contrast (RNC) pretraining, followed by a linear regressor and TCAV-based concept testing. The method yields a latent space that is continuously ordered with respect to the outcome, enabling concept explanations that cluster by outcome intervals and revealing urban patterns associated with different socioeconomic levels. On two geographies/tasks (income in France and liveability in the Netherlands), the approach improves predictive performance for income by about in and in Kendall's , while providing interpretable insights into which concepts drive different outcome ranges. Crucially, it does not require location-specific concept labels, enabling cross-region applicability and providing urban-planning insights through concept sensitivities, such as the role of vegetation in higher-income or higher-liveability areas.

Abstract

Predicting socioeconomic indicators from satellite imagery with deep learning has become an increasingly popular research direction. Post-hoc concept-based explanations can be an important step towards broader adoption of these models in policy-making as they enable the interpretation of socioeconomic outcomes based on visual concepts that are intuitive to humans. In this paper, we study the interplay between representation learning using an additional task-specific contrastive loss and post-hoc concept explainability for socioeconomic studies. Our results on two different geographical locations and tasks indicate that the task-specific pretraining imposes a continuous ordering of the latent space embeddings according to the socioeconomic outcomes. This improves the model's interpretability as it enables the latent space of the model to associate concepts encoding typical urban and natural area patterns with continuous intervals of socioeconomic outcomes. Further, we illustrate how analyzing the model's conceptual sensitivity for the intervals of socioeconomic outcomes can shed light on new insights for urban studies.
Paper Structure (23 sections, 5 equations, 11 figures, 2 tables)

This paper contains 23 sections, 5 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Flowchart of the pipeline presented in this research. For both tasks, we use Rank-n Contrast to pre-train the feature encoder to produce embeddings that strongly relate to socioeconomic task scores. Secondly, we freeze the encoder weights and probe a linear layer to regress the task-specific score. Lastly, we use TCAV to study the relation of various urban concepts to socioeconomic scores.
  • Figure 2: Household income instance activations in the average pooling layer visualized with t-SNE. By pre-training with Rank-n Contrast, the latent space can be ordered according to the task regression values, rather than the visual features of images. This results in an embedding space that aligns with the socioeconomic outcomes, and therefore is better suited for interpretability.
  • Figure 3: Concept accuracy in the average pooling layer of the Resnet-50 encoder for the household income (left) and livability (right) datasets. The contrastive pretraining improves the linear separability of the concepts in the latent space.
  • Figure 4: Instance to concept alignment as seen in the average pooling layer of each model. The instances are colored according to the concept with the highest cosine similarity (after normalizing the similarities per concept with the $L_2$ norm). The most similar concept to an instance can be interpreted as the concept that is most closely aligned to the instance's socioeconomic outcome.
  • Figure 5: The TCAV sensitivity of the vegetation concept for the income (left) and liveability (right) datasets. The magnitude values are normalized in the range [-1, 1] by applying separate min-max normalization to the negative and to the positive TCAV values, respectively. For income, the sensitivity to vegetation is highest among the high-income areas. In other words, adding vegetation to higher-income areas increases the perceived income of the neighborhood by the model. For liveability, we observe a similar effect, as the strongest increase in perceived liveability by the model in highly liveable areas can be achieved by increasing their amount of natural vegetation.
  • ...and 6 more figures