Perceptions of Edinburgh: Capturing Neighbourhood Characteristics by Clustering Geoparsed Local News

Andreas Grivas; Claire Grover; Richard Tobin; Clare Llewellyn; Eleojo Oluwaseun Abubakar; Chunyu Zheng; Chris Dibben; Alan Marshall; Jamie Pearce; Beatrice Alex

Perceptions of Edinburgh: Capturing Neighbourhood Characteristics by Clustering Geoparsed Local News

Andreas Grivas, Claire Grover, Richard Tobin, Clare Llewellyn, Eleojo Oluwaseun Abubakar, Chunyu Zheng, Chris Dibben, Alan Marshall, Jamie Pearce, Beatrice Alex

TL;DR

This work combines street-level geoparsing tailored to the locality with clustering of full news articles, enabling a more detailed examination of neighbourhood characteristics, and shows how NLP can be used to unlock further information about neighbourhoods by analysing, geoparsing and clustering news articles.

Abstract

The communities that we live in affect our health in ways that are complex and hard to define. Moreover, our understanding of the place-based processes affecting health and inequalities is limited. This undermines the development of robust policy interventions to improve local health and well-being. News media provides social and community information that may be useful in health studies. Here we propose a methodology for characterising neighbourhoods by using local news articles. More specifically, we show how we can use Natural Language Processing (NLP) to unlock further information about neighbourhoods by analysing, geoparsing and clustering news articles. Our work is novel because we combine street-level geoparsing tailored to the locality with clustering of full news articles, enabling a more detailed examination of neighbourhood characteristics. We evaluate our outputs and show via a confluence of evidence, both from a qualitative and a quantitative perspective, that the themes we extract from news articles are sensible and reflect many characteristics of the real world. This is significant because it allows us to better understand the effects of neighbourhoods on health. Our findings on neighbourhood characterisation using news data will support a new generation of place-based research which examines a wider set of spatial processes and how they affect health, enabling new epidemiological research.

Perceptions of Edinburgh: Capturing Neighbourhood Characteristics by Clustering Geoparsed Local News

TL;DR

Abstract

Paper Structure (58 sections, 1 equation, 10 figures, 3 tables)

This paper contains 58 sections, 1 equation, 10 figures, 3 tables.

Introduction
Related Work
Capturing Location Characteristics
Information Extraction from News Data
News Data
Collection
Deduplication
Natural Language Processing Methodology
Location
Article Groups
Location and Area Identification
Preprocessing
Named Entity Recognition
Georesolution
Identifying Data Zone Mentions
...and 43 more sections

Figures (10)

Figure 1: Given a dataset of Edinburgh news articles (N=66,601) we a) identify which locations are mentioned in the articles and b) cluster the articles into themes. We then aggregate the cluster information by location and summarise neighbourhoods as a distribution over themes.
Figure 2: The clustering hierarchy contains semantically meaningful groupings of clusters (themes), which we have annotated in green. Each green node corresponds to a cluster id, while red nodes are candidate clusters that were not selected by the clustering algorithm.
Figure 3: Word clouds of clusters (top 5, ordered left to right) which are most correlated to SIMD 2020v2 crime rate metric, measured by Spearman's rho, $\rho$.
Figure 4: Characterising two neighbourhoods in terms of their distribution over themes. The distribution over themes is broken down into distributions over topics which can be seen in the inner ring of the chart.
Figure 5: Hierarchy of clusters
...and 5 more figures

Perceptions of Edinburgh: Capturing Neighbourhood Characteristics by Clustering Geoparsed Local News

TL;DR

Abstract

Perceptions of Edinburgh: Capturing Neighbourhood Characteristics by Clustering Geoparsed Local News

Authors

TL;DR

Abstract

Table of Contents

Figures (10)