Table of Contents
Fetching ...

Named entity recognition for Serbian legal documents: Design, methodology and dataset development

Vladimir Kalušev, Branko Brkljač

TL;DR

The paper tackles NER for Serbian legal documents, a low-resource language domain, by fine-tuning a Serbian-specific PTM (BERTić) using an ELECTRA-style objective for token classification. It introduces a novel corpus of 75 appellate court rulings, transliterated to Latin and annotated with BIO across 15 NER labels (8 entity types), and reports high cross-validation performance with a mean $F_1$ of $0.96$ and strong robustness to noisy inputs. The dataset and model are publicly released, showcasing feasibility for domain adaptation in legal NLP for Serbian and similar languages. This work advances practical Serbian legal document processing and provides a template for dataset creation and low-resource fine-tuning in other official-domain languages.

Abstract

Recent advancements in the field of natural language processing (NLP) and especially large language models (LLMs) and their numerous applications have brought research attention to design of different document processing tools and enhancements in the process of document archiving, search and retrieval. Domain of official, legal documents is especially interesting due to vast amount of data generated on the daily basis, as well as the significant community of interested practitioners (lawyers, law offices, administrative workers, state institutions and citizens). Providing efficient ways for automation of everyday work involving legal documents is therefore expected to have significant impact in different fields. In this work we present one LLM based solution for Named Entity Recognition (NER) in the case of legal documents written in Serbian language. It leverages on the pre-trained bidirectional encoder representations from transformers (BERT), which had been carefully adapted to the specific task of identifying and classifying specific data points from textual content. Besides novel dataset development for Serbian language (involving public court rulings), presented system design and applied methodology, the paper also discusses achieved performance metrics and their implications for objective assessment of the proposed solution. Performed cross-validation tests on the created manually labeled dataset with mean $F_1$ score of 0.96 and additional results on the examples of intentionally modified text inputs confirm applicability of the proposed system design and robustness of the developed NER solution.

Named entity recognition for Serbian legal documents: Design, methodology and dataset development

TL;DR

The paper tackles NER for Serbian legal documents, a low-resource language domain, by fine-tuning a Serbian-specific PTM (BERTić) using an ELECTRA-style objective for token classification. It introduces a novel corpus of 75 appellate court rulings, transliterated to Latin and annotated with BIO across 15 NER labels (8 entity types), and reports high cross-validation performance with a mean of and strong robustness to noisy inputs. The dataset and model are publicly released, showcasing feasibility for domain adaptation in legal NLP for Serbian and similar languages. This work advances practical Serbian legal document processing and provides a template for dataset creation and low-resource fine-tuning in other official-domain languages.

Abstract

Recent advancements in the field of natural language processing (NLP) and especially large language models (LLMs) and their numerous applications have brought research attention to design of different document processing tools and enhancements in the process of document archiving, search and retrieval. Domain of official, legal documents is especially interesting due to vast amount of data generated on the daily basis, as well as the significant community of interested practitioners (lawyers, law offices, administrative workers, state institutions and citizens). Providing efficient ways for automation of everyday work involving legal documents is therefore expected to have significant impact in different fields. In this work we present one LLM based solution for Named Entity Recognition (NER) in the case of legal documents written in Serbian language. It leverages on the pre-trained bidirectional encoder representations from transformers (BERT), which had been carefully adapted to the specific task of identifying and classifying specific data points from textual content. Besides novel dataset development for Serbian language (involving public court rulings), presented system design and applied methodology, the paper also discusses achieved performance metrics and their implications for objective assessment of the proposed solution. Performed cross-validation tests on the created manually labeled dataset with mean score of 0.96 and additional results on the examples of intentionally modified text inputs confirm applicability of the proposed system design and robustness of the developed NER solution.

Paper Structure

This paper contains 12 sections, 7 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: (a) An illustration of original court ruling in Cyrillic script; (b) annotation process with BIO scheme; (c) number of NE types appearances per each cross-validation subset (random sampling procedure is described in Algorithm \ref{['alg:expsetup']}).
  • Figure 2: (a) Model optimization loss, and (b) mean accuracy over training iteration steps.
  • Figure 3: (a) Precision and (b) recall curves for each of 15 NER output classes (categories) over training iteration steps.
  • Figure 4: $F_1$ measure per each output class vs training iterations.
  • Figure 5: Accuracy assessment matrix
  • ...and 1 more figures