Text clustering applied to data augmentation in legal contexts
Lucas José Gonçalves Freitas, Thaís Rodrigues, Guilherme Rodrigues, Pamella Edokawa, Ariane Farias
TL;DR
The work tackles label scarcity in legal text classification for Sustainable Development Goals by introducing a clustering-based data augmentation workflow that combines unlabeled and expert-labeled data. It embeds texts with doc2vec, applies k-means clustering to propagate synthetic labels within cluster centers, and trains LSTM classifiers on augmented data, achieving improved accuracy and sensitivity across most SDGs (up to 17% gains for SDG 15). The approach expands the effective training set (sometimes 5x) and enhances model performance while reducing manual labeling burden, demonstrated on Brazilian STF datasets. This methodology offers a practical path for courts to leverage abundant unlabeled legal data to strengthen NLP-driven decision-support while outlining avenues for integrating more sophisticated embeddings and clustering techniques in future work.
Abstract
Data analysis and machine learning are of preeminent importance in the legal domain, especially in tasks like clustering and text classification. In this study, we harnessed the power of natural language processing tools to enhance datasets meticulously curated by experts. This process significantly improved the classification workflow for legal texts using machine learning techniques. We considered the Sustainable Development Goals (SDGs) data from the United Nations 2030 Agenda as a practical case study. Data augmentation clustering-based strategy led to remarkable enhancements in the accuracy and sensitivity metrics of classification models. For certain SDGs within the 2030 Agenda, we observed performance gains of over 15%. In some cases, the example base expanded by a noteworthy factor of 5. When dealing with unclassified legal texts, data augmentation strategies centered around clustering prove to be highly effective. They provide a valuable means to expand the existing knowledge base without the need for labor-intensive manual classification efforts.
