Table of Contents
Fetching ...

Semantic Clustering of Civic Proposals: A Case Study on Brazil's National Participation Platform

Ronivaldo Ferreira, Guilherme da Silva, Carla Rocha, Gustavo Pinto

TL;DR

The paper tackles the challenge of turning massive citizen proposals on Brazil Participativo into actionable policy inputs by building a scalable topic-modeling pipeline that combines BERTopic with VCGE seed words and automatic LLM validation. It systematically tunes embeddings and BERTopic hyperparameters, and demonstrates that semi-supervised guidance significantly improves alignment with official VCGE categories while maintaining topic diversity. External validation via ARI and NMI confirms substantial gains in semantic alignment, supported by automated labeling and interpretable topic naming. The approach enables government platforms to scale citizen input analysis with reduced manual effort, enhancing transparency and traceability in public policy cycles.

Abstract

Promoting participation on digital platforms such as Brasil Participativo has emerged as a top priority for governments worldwide. However, due to the sheer volume of contributions, much of this engagement goes underutilized, as organizing it presents significant challenges: (1) manual classification is unfeasible at scale; (2) expert involvement is required; and (3) alignment with official taxonomies is necessary. In this paper, we introduce an approach that combines BERTopic with seed words and automatic validation by large language models. Initial results indicate that the generated topics are coherent and institutionally aligned, with minimal human effort. This methodology enables governments to transform large volumes of citizen input into actionable data for public policy.

Semantic Clustering of Civic Proposals: A Case Study on Brazil's National Participation Platform

TL;DR

The paper tackles the challenge of turning massive citizen proposals on Brazil Participativo into actionable policy inputs by building a scalable topic-modeling pipeline that combines BERTopic with VCGE seed words and automatic LLM validation. It systematically tunes embeddings and BERTopic hyperparameters, and demonstrates that semi-supervised guidance significantly improves alignment with official VCGE categories while maintaining topic diversity. External validation via ARI and NMI confirms substantial gains in semantic alignment, supported by automated labeling and interpretable topic naming. The approach enables government platforms to scale citizen input analysis with reduced manual effort, enhancing transparency and traceability in public policy cycles.

Abstract

Promoting participation on digital platforms such as Brasil Participativo has emerged as a top priority for governments worldwide. However, due to the sheer volume of contributions, much of this engagement goes underutilized, as organizing it presents significant challenges: (1) manual classification is unfeasible at scale; (2) expert involvement is required; and (3) alignment with official taxonomies is necessary. In this paper, we introduce an approach that combines BERTopic with seed words and automatic validation by large language models. Initial results indicate that the generated topics are coherent and institutionally aligned, with minimal human effort. This methodology enables governments to transform large volumes of citizen input into actionable data for public policy.

Paper Structure

This paper contains 24 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Pipeline de categorização temática com BERTopic
  • Figure 2: Weighted Score por Modelo e Número de Tópicos
  • Figure 3: Comparação entre não-supervisionado e semi-supervisionado.
  • Figure 4: Métricas externas de alinhamento (ARI e NMI) nos níveis N1 e N2.