Table of Contents
Fetching ...

Large Language Models in Legislative Content Analysis: A Dataset from the Polish Parliament

Arkadiusz Bryłkowski, Jakub Klikowski

TL;DR

The paper addresses the lack of Polish-language datasets for legislative content analysis by constructing three benchmark tasks (PPC, PPO, STP) from official Polish government sources. It evaluates Polish-domain LLMs (e.g., HerBERT, PL-RoBERTa, PL-GPT2; and generative models like T5) against multilingual baselines using five-fold cross-validation and task-specific metrics, revealing that Polish-adapted models excel in classification tasks while summarization remains challenging. Key contributions include the PPC, PPO, and STP datasets, a public repository for data and scripts, and empirical insights into model performance and domain adaptation in the Polish legal domain. The findings highlight both the practical potential of LLMs to automate legislative analysis and the need for careful handling of legal language and standardization, with implications for future benchmarks and tool development in Polish law.

Abstract

Large language models (LLMs) are among the best methods for processing natural language, partly due to their versatility. At the same time, domain-specific LLMs are more practical in real-life applications. This work introduces a novel natural language dataset created by acquired data from official legislative authorities' websites. The study focuses on formulating three natural language processing (NLP) tasks to evaluate the effectiveness of LLMs on legislative content analysis within the context of the Polish legal system. Key findings highlight the potential of LLMs in automating and enhancing legislative content analysis while emphasizing specific challenges, such as understanding legal context. The research contributes to the advancement of NLP in the legal field, particularly in the Polish language. It has been demonstrated that even commonly accessible data can be practically utilized for legislative content analysis.

Large Language Models in Legislative Content Analysis: A Dataset from the Polish Parliament

TL;DR

The paper addresses the lack of Polish-language datasets for legislative content analysis by constructing three benchmark tasks (PPC, PPO, STP) from official Polish government sources. It evaluates Polish-domain LLMs (e.g., HerBERT, PL-RoBERTa, PL-GPT2; and generative models like T5) against multilingual baselines using five-fold cross-validation and task-specific metrics, revealing that Polish-adapted models excel in classification tasks while summarization remains challenging. Key contributions include the PPC, PPO, and STP datasets, a public repository for data and scripts, and empirical insights into model performance and domain adaptation in the Polish legal domain. The findings highlight both the practical potential of LLMs to automate legislative analysis and the need for careful handling of legal language and standardization, with implications for future benchmarks and tool development in Polish law.

Abstract

Large language models (LLMs) are among the best methods for processing natural language, partly due to their versatility. At the same time, domain-specific LLMs are more practical in real-life applications. This work introduces a novel natural language dataset created by acquired data from official legislative authorities' websites. The study focuses on formulating three natural language processing (NLP) tasks to evaluate the effectiveness of LLMs on legislative content analysis within the context of the Polish legal system. Key findings highlight the potential of LLMs in automating and enhancing legislative content analysis while emphasizing specific challenges, such as understanding legal context. The research contributes to the advancement of NLP in the legal field, particularly in the Polish language. It has been demonstrated that even commonly accessible data can be practically utilized for legislative content analysis.

Paper Structure

This paper contains 16 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Histogram of labels frequency in the PPC dataset.
  • Figure 2: Word frequency in legislative acts for the PPC task.
  • Figure 3: Word frequency in legislative acts for the PPO task.
  • Figure 4: Word frequency in legislative acts for the STP task.