Table of Contents
Fetching ...

LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification

Shubham Kumar Nigam, Tanmay Dubey, Govind Sharma, Noel Shallum, Kripabandhu Ghosh, Arnab Bhattacharya

TL;DR

LegalSeg introduces the largest annotated corpus for rhetorical role classification in Indian judgments and benchmarks a range of architectures on this task. The study shows that models leveraging document structure and sentence-level context, such as Hierarchical BiLSTM-CRF and ToInLegalBERT, outperform sentence-level baselines, while open-source LLMs require further domain-specific tuning. Key contributions include dataset creation, multiple methodological variants (InLegalBERT, GNNs, Role-Aware Transformers, and RhetoricLLaMA), and an in-depth error analysis highlighting class imbalance and role-confusion as remaining challenges. The work advances legal NLP by providing a robust resource and a comprehensive evaluation framework to drive future research and practical legal analytics.

Abstract

In this paper, we address the task of semantic segmentation of legal documents through rhetorical role classification, with a focus on Indian legal judgments. We introduce LegalSeg, the largest annotated dataset for this task, comprising over 7,000 documents and 1.4 million sentences, labeled with 7 rhetorical roles. To benchmark performance, we evaluate multiple state-of-the-art models, including Hierarchical BiLSTM-CRF, TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an instruction-tuned large language model. Our results demonstrate that models incorporating broader context, structural relationships, and sequential sentence information outperform those relying solely on sentence-level features. Additionally, we conducted experiments using surrounding context and predicted or actual labels of neighboring sentences to assess their impact on classification accuracy. Despite these advancements, challenges persist in distinguishing between closely related roles and addressing class imbalance. Our work underscores the potential of advanced techniques for improving legal document understanding and sets a strong foundation for future research in legal NLP.

LegalSeg: Unlocking the Structure of Indian Legal Judgments Through Rhetorical Role Classification

TL;DR

LegalSeg introduces the largest annotated corpus for rhetorical role classification in Indian judgments and benchmarks a range of architectures on this task. The study shows that models leveraging document structure and sentence-level context, such as Hierarchical BiLSTM-CRF and ToInLegalBERT, outperform sentence-level baselines, while open-source LLMs require further domain-specific tuning. Key contributions include dataset creation, multiple methodological variants (InLegalBERT, GNNs, Role-Aware Transformers, and RhetoricLLaMA), and an in-depth error analysis highlighting class imbalance and role-confusion as remaining challenges. The work advances legal NLP by providing a robust resource and a comprehensive evaluation framework to drive future research and practical legal analytics.

Abstract

In this paper, we address the task of semantic segmentation of legal documents through rhetorical role classification, with a focus on Indian legal judgments. We introduce LegalSeg, the largest annotated dataset for this task, comprising over 7,000 documents and 1.4 million sentences, labeled with 7 rhetorical roles. To benchmark performance, we evaluate multiple state-of-the-art models, including Hierarchical BiLSTM-CRF, TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and Role-Aware Transformers, alongside an exploratory RhetoricLLaMA, an instruction-tuned large language model. Our results demonstrate that models incorporating broader context, structural relationships, and sequential sentence information outperform those relying solely on sentence-level features. Additionally, we conducted experiments using surrounding context and predicted or actual labels of neighboring sentences to assess their impact on classification accuracy. Despite these advancements, challenges persist in distinguishing between closely related roles and addressing class imbalance. Our work underscores the potential of advanced techniques for improving legal document understanding and sets a strong foundation for future research in legal NLP.

Paper Structure

This paper contains 33 sections, 4 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Example illustrating document segmentation using rhetorical roles. The left side shows an excerpt from a legal document, while the right side demonstrates the segmentation and labeling of sentences.
  • Figure 2: Distribution of Rhetorical Roles within the Dataset.
  • Figure 3: Confusion matrix for rhetorical role classification using Hierarchical BiLSTM-CRF model.
  • Figure 4: Confusion matrix for rhetorical role classification using the Multi-Task Learning (MTL) model.
  • Figure 5: Confusion matrix for rhetorical role classification using GNN.
  • ...and 8 more figures