Table of Contents
Fetching ...

Text clustering applied to data augmentation in legal contexts

Lucas José Gonçalves Freitas, Thaís Rodrigues, Guilherme Rodrigues, Pamella Edokawa, Ariane Farias

TL;DR

The work tackles label scarcity in legal text classification for Sustainable Development Goals by introducing a clustering-based data augmentation workflow that combines unlabeled and expert-labeled data. It embeds texts with doc2vec, applies k-means clustering to propagate synthetic labels within cluster centers, and trains LSTM classifiers on augmented data, achieving improved accuracy and sensitivity across most SDGs (up to 17% gains for SDG 15). The approach expands the effective training set (sometimes 5x) and enhances model performance while reducing manual labeling burden, demonstrated on Brazilian STF datasets. This methodology offers a practical path for courts to leverage abundant unlabeled legal data to strengthen NLP-driven decision-support while outlining avenues for integrating more sophisticated embeddings and clustering techniques in future work.

Abstract

Data analysis and machine learning are of preeminent importance in the legal domain, especially in tasks like clustering and text classification. In this study, we harnessed the power of natural language processing tools to enhance datasets meticulously curated by experts. This process significantly improved the classification workflow for legal texts using machine learning techniques. We considered the Sustainable Development Goals (SDGs) data from the United Nations 2030 Agenda as a practical case study. Data augmentation clustering-based strategy led to remarkable enhancements in the accuracy and sensitivity metrics of classification models. For certain SDGs within the 2030 Agenda, we observed performance gains of over 15%. In some cases, the example base expanded by a noteworthy factor of 5. When dealing with unclassified legal texts, data augmentation strategies centered around clustering prove to be highly effective. They provide a valuable means to expand the existing knowledge base without the need for labor-intensive manual classification efforts.

Text clustering applied to data augmentation in legal contexts

TL;DR

The work tackles label scarcity in legal text classification for Sustainable Development Goals by introducing a clustering-based data augmentation workflow that combines unlabeled and expert-labeled data. It embeds texts with doc2vec, applies k-means clustering to propagate synthetic labels within cluster centers, and trains LSTM classifiers on augmented data, achieving improved accuracy and sensitivity across most SDGs (up to 17% gains for SDG 15). The approach expands the effective training set (sometimes 5x) and enhances model performance while reducing manual labeling burden, demonstrated on Brazilian STF datasets. This methodology offers a practical path for courts to leverage abundant unlabeled legal data to strengthen NLP-driven decision-support while outlining avenues for integrating more sophisticated embeddings and clustering techniques in future work.

Abstract

Data analysis and machine learning are of preeminent importance in the legal domain, especially in tasks like clustering and text classification. In this study, we harnessed the power of natural language processing tools to enhance datasets meticulously curated by experts. This process significantly improved the classification workflow for legal texts using machine learning techniques. We considered the Sustainable Development Goals (SDGs) data from the United Nations 2030 Agenda as a practical case study. Data augmentation clustering-based strategy led to remarkable enhancements in the accuracy and sensitivity metrics of classification models. For certain SDGs within the 2030 Agenda, we observed performance gains of over 15%. In some cases, the example base expanded by a noteworthy factor of 5. When dealing with unclassified legal texts, data augmentation strategies centered around clustering prove to be highly effective. They provide a valuable means to expand the existing knowledge base without the need for labor-intensive manual classification efforts.
Paper Structure (14 sections, 4 figures, 5 tables)

This paper contains 14 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Basic flowchart. Text cleaning, embedding and data organization are performed for all SDGs in batch (solid line steps), while clustering, label propagation and classification are performed individually for each of the SDGs (dashed line steps).
  • Figure 2: Clustering strategy to data augmentation. For exemplification, data was divided into 2 clusters and unlabeled legal processes (gray triangle) in the radius R were selected. Crosses illustrate originally labeled text (black cross denotes label 0 and blue cross denotes label 1). Label propagation is performed according to the proportion of labeled processes in each selected region. Synthetic label 0 was assigned to the first cluster (black triangles) and synthetic label 1 was assigned to the second cluster (blue triangles).
  • Figure 3: Accuracy for bootstrap samples from original and augmented datasets for SDGs 3, 4, 8, 9 and 10.
  • Figure 4: Accuracy for bootstrap samples from original and augmented datasets for SDGs 11, 15, 16 and 17.