Table of Contents
Fetching ...

From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification

Shanshan Xu, T. Y. S. S Santosh, Oana Ichim, Isabella Risini, Barbara Plank, Matthias Grabmair

TL;DR

The paper tackles the challenge of trustworthy, explainable Case Outcome Classification in legal NLP by introducing RaVE, a token-level rationale dataset derived from two ECtHR experts and accompanied by a two-level disagreement taxonomy with COC-specific subcategories. It systematically analyzes expert disagreements, quantifies factors via proxy variables, and demonstrates that underspecification in allegation information drives low inter-annotator agreement, even as state-of-the-art models struggle to align with expert rationales. The study shows that while article-aware models improve prediction metrics, alignment with legal rationales remains limited, underscoring a gap between model reasoning and expert judgment. These findings advocate for accounting for Human Label Variation in benchmark datasets and for closer collaboration between legal experts and ML researchers to build more trustworthy, explainable COC systems.

Abstract

In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RAVE: Rationale Variation in ECHR1, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of SOTA COC models on RAVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case's facts supposedly relevant to its outcome.

From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification

TL;DR

The paper tackles the challenge of trustworthy, explainable Case Outcome Classification in legal NLP by introducing RaVE, a token-level rationale dataset derived from two ECtHR experts and accompanied by a two-level disagreement taxonomy with COC-specific subcategories. It systematically analyzes expert disagreements, quantifies factors via proxy variables, and demonstrates that underspecification in allegation information drives low inter-annotator agreement, even as state-of-the-art models struggle to align with expert rationales. The study shows that while article-aware models improve prediction metrics, alignment with legal rationales remains limited, underscoring a gap between model reasoning and expert judgment. These findings advocate for accounting for Human Label Variation in benchmark datasets and for closer collaboration between legal experts and ML researchers to build more trustworthy, explainable COC systems.

Abstract

In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RAVE: Rationale Variation in ECHR1, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of SOTA COC models on RAVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case's facts supposedly relevant to its outcome.
Paper Structure (32 sections, 5 equations, 5 figures, 4 tables)

This paper contains 32 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The disagreement between experts' annotation and the misalignment of model and experts'.
  • Figure 2: Taxonomy of disagreement sources: Macro Categories (Yellow), Fine-grained Categories (Green), COC rationale annotation specific (Pink), Proxy Variables (Blue).
  • Figure 3: Distribution of the IAA Kappa Scores
  • Figure 4: Screenshot of the GLOSS annotation interface
  • Figure 5: Base Model Architectures