Automating the Identification of High-Value Datasets in Open Government Data Portals

Alfonso Quarati; Anastasija Nikiforova

Automating the Identification of High-Value Datasets in Open Government Data Portals

Alfonso Quarati, Anastasija Nikiforova

TL;DR

This work tackles automated identification of High-Value Datasets (HVD) in Open Government Data portals by leveraging actual user interest reflected in usage statistics. It introduces the High-Value Data index ($HVDi$), a robust metric that fuses median and tail behavior with the share of datasets per category, enabling cross-portal comparisons and region-specific prioritization. The authors construct a Comprehensive Subset of Categories (CSC) from a large US portal corpus and align portal categories to this common frame to enable meaningful, multi-portal analysis, demonstrated on nine US municipalities powered by Socrata. The methodology is demonstrated to reveal category-level HVD patterns, supports open governance decisions, and is made extensible via open-source code, with clear discussion of limitations and directions for integrating ex-ante considerations and improved data quality control.

Abstract

Recognized for fostering innovation and transparency, driving economic growth, enhancing public services, supporting research, empowering citizens, and promoting environmental sustainability, High-Value Datasets (HVD) play a crucial role in the broader Open Government Data (OGD) movement. However, identifying HVD presents a resource-intensive and complex challenge due to the nuanced nature of data value. Our proposal aims to automate the identification of HVDs on OGD portals using a quantitative approach based on a detailed analysis of user interest derived from data usage statistics, thereby minimizing the need for human intervention. The proposed method involves extracting download data, analyzing metrics to identify high-value categories, and comparing HVD datasets across different portals. This automated process provides valuable insights into trends in dataset usage, reflecting citizens' needs and preferences. The effectiveness of our approach is demonstrated through its application to a sample of US OGD city portals. The practical implications of this study include contributing to the understanding of HVD at both local and national levels. By providing a systematic and efficient means of identifying HVD, our approach aims to inform open governance initiatives and practices, aiding OGD portal managers and public authorities in their efforts to optimize data dissemination and utilization.

Automating the Identification of High-Value Datasets in Open Government Data Portals

TL;DR

), a robust metric that fuses median and tail behavior with the share of datasets per category, enabling cross-portal comparisons and region-specific prioritization. The authors construct a Comprehensive Subset of Categories (CSC) from a large US portal corpus and align portal categories to this common frame to enable meaningful, multi-portal analysis, demonstrated on nine US municipalities powered by Socrata. The methodology is demonstrated to reveal category-level HVD patterns, supports open governance decisions, and is made extensible via open-source code, with clear discussion of limitations and directions for integrating ex-ante considerations and improved data quality control.

Abstract

Paper Structure (24 sections, 2 equations, 13 figures, 4 tables)

This paper contains 24 sections, 2 equations, 13 figures, 4 tables.

Introduction
Background
Government and local initiatives
Related works
Material & Methods
Methods
Sample and data collection
Results
RQ1: What levels of interest do users show in OGD portals?
(RQ2) How can thematic categorization and impact assessment of OGD be conducted to optimize its value and relevance for regions or countries?
Extracting Thematic Information
Assessing Thematic Impact
Number of downloads
Mean and Median
High-Value Data index
...and 9 more sections

Figures (13)

Figure 1: The overall distributions of Views and Downloads for the 9 US cities. Frequency numbers are grouped into five classes: 0–10, 10–100, 100–1000, 1000–10,000 and >10,000.
Figure 2: NY portal categories, ordered by number of datasets per category
Figure 3: Percentage of not categorized datasets in the portals sample
Figure 4: NY portal categories, ordered by number of downloads
Figure 5: NY portal categories, ordered by average downloads per datasets (Mean)
...and 8 more figures

Automating the Identification of High-Value Datasets in Open Government Data Portals

TL;DR

Abstract

Automating the Identification of High-Value Datasets in Open Government Data Portals

Authors

TL;DR

Abstract

Table of Contents

Figures (13)