Mining United Nations General Assembly Debates
Mateusz Grzyb, Mateusz Krzyziński, Bartłomiej Sobieski, Mikołaj Spytek, Bartosz Pieliński, Daniel Dan, Anna Wróblewska
TL;DR
This work addresses the need to systematically analyze UNGA speeches by building a richly annotated UN General Debate Corpus (UNGD) spanning 1946–2023 and enriching it with metadata and geospatial context. It employs BERTopic with multiple pretrained embeddings to extract topics, comparing performance to LDA via topic coherence and diversity metrics, and validates results with an interactive Streamlit visualization interface. The enhanced corpus includes 10,679 speeches and 10 new covariates, with improved metadata accuracy and OCR-driven updates for 2023 content. The study finds BERTopic variants—particularly DistilBERT-based embeddings—superior in coherence and competitive in diversity, and delivers a user-friendly tool to support political science research and policy analysis on global diplomacy.
Abstract
This project explores the application of Natural Language Processing (NLP) techniques to analyse United Nations General Assembly (UNGA) speeches. Using NLP allows for the efficient processing and analysis of large volumes of textual data, enabling the extraction of semantic patterns, sentiment analysis, and topic modelling. Our goal is to deliver a comprehensive dataset and a tool (interface with descriptive statistics and automatically extracted topics) from which political scientists can derive insights into international relations and have the opportunity to have a nuanced understanding of global diplomatic discourse.
