CitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting Minutes
Ricardo Campos, Ana Filipa Pacheco, Ana Luísa Fernandes, Inês Cantante, Rute Rebouças, Luís Filipe Cunha, José Miguel Isidro, José Pedro Evans, Miguel Marques, Rodrigo Batista, Evelin Amorim, Alípio Jorge, Nuno Guimarães, Sérgio Nunes, António Leal, Purificação Silvano
TL;DR
CitiLink-Minutes tackles the absence of richly annotated municipal governance data by delivering a multilayer, de-identified European Portuguese corpus of 120 municipal meeting minutes. The dataset employs a four-layer SemAF-based framework (Personal Information, Metadata, Subject of Discussion, Voting) with double annotation and curator validation, and is paired with an interactive dashboard and a temporal train/validation/test split. Baseline experiments show encoder-based models (notably BERTimbau) outperform generative baselines across metadata extraction, vote labeling, and multi-label topic classification, underscoring the suitability of structured extraction for municipal texts. The work advances IR/NLP for local governance and provides a reproducible, privacy-preserving resource with substantial potential for downstream civic analytics and cross-municipality comparisons.
Abstract
City councils play a crucial role in local governance, directly influencing citizens' daily lives through decisions made during municipal meetings. These deliberations are formally documented in meeting minutes, which serve as official records of discussions, decisions, and voting outcomes. Despite their importance, municipal meeting records have received little attention in Information Retrieval (IR) and Natural Language Processing (NLP), largely due to the lack of annotated datasets, which ultimately limit the development of computational models. To address this gap, we introduce CitiLink-Minutes, a multilayer dataset of 120 European Portuguese municipal meeting minutes from six municipalities. Unlike prior annotated datasets of parliamentary or video records, CitiLink-Minutes provides multilayer annotations and structured linkage of official written minutes. The dataset contains over one million tokens, with all personal identifiers de-identified. Each minute was manually annotated by two trained annotators and curated by an experienced linguist across three complementary dimensions: (1) metadata, (2) subjects of discussion, and (3) voting outcomes, totaling over 38,000 individual annotations. Released under FAIR principles and accompanied by baseline results on metadata extraction, topic classification, and vote labeling, CitiLink-Minutes demonstrates its potential for downstream NLP and IR tasks, while promoting transparent access to municipal decisions.
