Table of Contents
Fetching ...

Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings

Gili Goldin, Shuly Wintner

TL;DR

The paper tackles the scarcity of Hebrew NLP tools for parliamentary text by fine-tuning DictaBERT on the Knesset Corpus to create Knesset-DictaBERT. It employs masked language modeling with 256-token chunks and a 15% masking rate, initializing from the DictaBERT checkpoint and training on the full parliamentary dataset. Results show a substantial perplexity reduction (6.60 vs 22.87) and improved masked-token prediction accuracy (top-1: 52.55%, top-2: 63.07%, top-5: 73.59%) on a sizeable test set, indicating strong domain adaptation. The model is released on HuggingFace to support Hebrew parliamentary text analysis, enabling more accurate discourse understanding and downstream political-text research.

Abstract

We present Knesset-DictaBERT, a large Hebrew language model fine-tuned on the Knesset Corpus, which comprises Israeli parliamentary proceedings. The model is based on the DictaBERT architecture and demonstrates significant improvements in understanding parliamentary language according to the MLM task. We provide a detailed evaluation of the model's performance, showing improvements in perplexity and accuracy over the baseline DictaBERT model.

Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings

TL;DR

The paper tackles the scarcity of Hebrew NLP tools for parliamentary text by fine-tuning DictaBERT on the Knesset Corpus to create Knesset-DictaBERT. It employs masked language modeling with 256-token chunks and a 15% masking rate, initializing from the DictaBERT checkpoint and training on the full parliamentary dataset. Results show a substantial perplexity reduction (6.60 vs 22.87) and improved masked-token prediction accuracy (top-1: 52.55%, top-2: 63.07%, top-5: 73.59%) on a sizeable test set, indicating strong domain adaptation. The model is released on HuggingFace to support Hebrew parliamentary text analysis, enabling more accurate discourse understanding and downstream political-text research.

Abstract

We present Knesset-DictaBERT, a large Hebrew language model fine-tuned on the Knesset Corpus, which comprises Israeli parliamentary proceedings. The model is based on the DictaBERT architecture and demonstrates significant improvements in understanding parliamentary language according to the MLM task. We provide a detailed evaluation of the model's performance, showing improvements in perplexity and accuracy over the baseline DictaBERT model.
Paper Structure (6 sections, 1 table)