Table of Contents
Fetching ...

Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset

Santosh T. Y. S. S, Nina Baumgartner, Matthias Stürmer, Matthias Grabmair, Joel Niklaus

TL;DR

This paper addresses the explainability and fairness gap in Legal Judgement Prediction by focusing on a multilingual Swiss dataset (SJP). It introduces an occlusion-based explainability evaluation and a novel Lower Court Insertion (LCI) framework to quantify the influence of lower-court information on predictions. A fine-grained rationale dataset ( Supports vs. Opposes and Lower Court annotations) across German, French, and Italian is curated for 108 cases, and four occlusion test sets plus the LCI test set are released. Key findings show that higher predictive accuracy does not guarantee better explainability, and lower-court signals can bias predictions, underscoring the need for deconfounding and explainability-aligned evaluation in multilingual LJP.

Abstract

The assessment of explainability in Legal Judgement Prediction (LJP) systems is of paramount importance in building trustworthy and transparent systems, particularly considering the reliance of these systems on factors that may lack legal relevance or involve sensitive attributes. This study delves into the realm of explainability and fairness in LJP models, utilizing Swiss Judgement Prediction (SJP), the only available multilingual LJP dataset. We curate a comprehensive collection of rationales that `support' and `oppose' judgement from legal experts for 108 cases in German, French, and Italian. By employing an occlusion-based explainability approach, we evaluate the explainability performance of state-of-the-art monolingual and multilingual BERT-based LJP models, as well as models developed with techniques such as data augmentation and cross-lingual transfer, which demonstrated prediction performance improvement. Notably, our findings reveal that improved prediction performance does not necessarily correspond to enhanced explainability performance, underscoring the significance of evaluating models from an explainability perspective. Additionally, we introduce a novel evaluation framework, Lower Court Insertion (LCI), which allows us to quantify the influence of lower court information on model predictions, exposing current models' biases.

Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset

TL;DR

This paper addresses the explainability and fairness gap in Legal Judgement Prediction by focusing on a multilingual Swiss dataset (SJP). It introduces an occlusion-based explainability evaluation and a novel Lower Court Insertion (LCI) framework to quantify the influence of lower-court information on predictions. A fine-grained rationale dataset ( Supports vs. Opposes and Lower Court annotations) across German, French, and Italian is curated for 108 cases, and four occlusion test sets plus the LCI test set are released. Key findings show that higher predictive accuracy does not guarantee better explainability, and lower-court signals can bias predictions, underscoring the need for deconfounding and explainability-aligned evaluation in multilingual LJP.

Abstract

The assessment of explainability in Legal Judgement Prediction (LJP) systems is of paramount importance in building trustworthy and transparent systems, particularly considering the reliance of these systems on factors that may lack legal relevance or involve sensitive attributes. This study delves into the realm of explainability and fairness in LJP models, utilizing Swiss Judgement Prediction (SJP), the only available multilingual LJP dataset. We curate a comprehensive collection of rationales that `support' and `oppose' judgement from legal experts for 108 cases in German, French, and Italian. By employing an occlusion-based explainability approach, we evaluate the explainability performance of state-of-the-art monolingual and multilingual BERT-based LJP models, as well as models developed with techniques such as data augmentation and cross-lingual transfer, which demonstrated prediction performance improvement. Notably, our findings reveal that improved prediction performance does not necessarily correspond to enhanced explainability performance, underscoring the significance of evaluating models from an explainability perspective. Additionally, we introduce a novel evaluation framework, Lower Court Insertion (LCI), which allows us to quantify the influence of lower court information on model predictions, exposing current models' biases.
Paper Structure (18 sections, 2 figures, 9 tables)

This paper contains 18 sections, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Mean number of tokens annotated per label per annotator in German subset
  • Figure 2: Distribution of the number of tokens per label in the final dataset across each language.