Table of Contents
Fetching ...

Uchaguzi-2022: A Dataset of Citizen Reports on the 2022 Kenyan Election

Roberto Mondini, Neema Kotonya, Robert L. Logan, Elizabeth M Olson, Angela Oduor Lungati, Daniel Duke Odongo, Tim Ombasa, Hemank Lamba, Aoife Cahill, Joel R. Tetreault, Alejandro Jaimes

TL;DR

Uchaguzi-2022 introduces a richly annotated dataset of 14,169 citizen reports on the 2022 Kenyan election, linking topics, tags, and geospatial data to real-time events. The authors benchmark automated categorization and geotagging using both fully supervised encoder models and few-shot in-context learning with large language models, demonstrating that few-shot approaches can be competitive with standard fine-tuning, especially for low-resource classes. They also explore location extraction and geocoding as two-step geotagging, finding that LLM-based methods outperform traditional NER baselines and that geocoding quality improves with explicit location mentions. The work highlights the dataset’s potential for AI-for-Social-Good applications in election integrity monitoring in Africa, while acknowledging limitations in topic specificity and data scarcity for some tasks.

Abstract

Online reporting platforms have enabled citizens around the world to collectively share their opinions and report in real time on events impacting their local communities. Systematically organizing (e.g., categorizing by attributes) and geotagging large amounts of crowdsourced information is crucial to ensuring that accurate and meaningful insights can be drawn from this data and used by policy makers to bring about positive change. These tasks, however, typically require extensive manual annotation efforts. In this paper we present Uchaguzi-2022, a dataset of 14k categorized and geotagged citizen reports related to the 2022 Kenyan General Election containing mentions of election-related issues such as official misconduct, vote count irregularities, and acts of violence. We use this dataset to investigate whether language models can assist in scalably categorizing and geotagging reports, thus highlighting its potential application in the AI for Social Good space.

Uchaguzi-2022: A Dataset of Citizen Reports on the 2022 Kenyan Election

TL;DR

Uchaguzi-2022 introduces a richly annotated dataset of 14,169 citizen reports on the 2022 Kenyan election, linking topics, tags, and geospatial data to real-time events. The authors benchmark automated categorization and geotagging using both fully supervised encoder models and few-shot in-context learning with large language models, demonstrating that few-shot approaches can be competitive with standard fine-tuning, especially for low-resource classes. They also explore location extraction and geocoding as two-step geotagging, finding that LLM-based methods outperform traditional NER baselines and that geocoding quality improves with explicit location mentions. The work highlights the dataset’s potential for AI-for-Social-Good applications in election integrity monitoring in Africa, while acknowledging limitations in topic specificity and data scarcity for some tasks.

Abstract

Online reporting platforms have enabled citizens around the world to collectively share their opinions and report in real time on events impacting their local communities. Systematically organizing (e.g., categorizing by attributes) and geotagging large amounts of crowdsourced information is crucial to ensuring that accurate and meaningful insights can be drawn from this data and used by policy makers to bring about positive change. These tasks, however, typically require extensive manual annotation efforts. In this paper we present Uchaguzi-2022, a dataset of 14k categorized and geotagged citizen reports related to the 2022 Kenyan General Election containing mentions of election-related issues such as official misconduct, vote count irregularities, and acts of violence. We use this dataset to investigate whether language models can assist in scalably categorizing and geotagging reports, thus highlighting its potential application in the AI for Social Good space.

Paper Structure

This paper contains 30 sections, 11 figures, 26 tables.

Figures (11)

  • Figure 1: Annotated report example. Each input message (top) is categorized and geolocated (bottom).
  • Figure 2: Distributions of title and message lengths.
  • Figure 3: Reports per capita. Scale is per 100K citizens.
  • Figure 4: Report counts over time(left)and words characterizing different phases of the election(right).
  • Figure 5: Example of disagreement between volunteer and expert annotations.
  • ...and 6 more figures