Table of Contents
Fetching ...

Advantages of Domain Knowledge Injection for Legal Document Summarization: A Case Study on Summarizing Indian Court Judgments in English and Hindi

Debtanu Datta, Rajdeep Mukherjee, Adrijit Goswami, Saptarshi Ghosh

TL;DR

This work tackles the challenge of summarizing Indian court judgments for English and Hindi readers by injecting legal-domain knowledge into both extractive and generative models. It leverages domain-adapted encoders (InLegalBERT) and continual pre-training on Indian legal corpora, including multilingual pre-training, to improve performance on MILDSum for EN-to-EN and EN-to-HI tasks. The authors demonstrate significant gains across standard relevance and factual-consistency metrics, supported by expert evaluation, and show that domain-specific models can surpass GPT-4 on these tasks. The findings highlight practical potential for accessible legal information and suggest that memory-efficient pre-training strategies can achieve comparable results with reduced resources.

Abstract

Summarizing Indian legal court judgments is a complex task not only due to the intricate language and unstructured nature of the legal texts, but also since a large section of the Indian population does not understand the complex English in which legal text is written, thus requiring summaries in Indian languages. In this study, we aim to improve the summarization of Indian legal text to generate summaries in both English and Hindi (the most widely spoken Indian language), by injecting domain knowledge into diverse summarization models. We propose a framework to enhance extractive neural summarization models by incorporating domain-specific pre-trained encoders tailored for legal texts. Further, we explore the injection of legal domain knowledge into generative models (including Large Language Models) through continual pre-training on large legal corpora in English and Hindi. Our proposed approaches achieve statistically significant improvements in both English-to-English and English-to-Hindi Indian legal document summarization, as measured by standard evaluation metrics, factual consistency metrics, and legal domain-specific metrics. Furthermore, these improvements are validated through domain experts, demonstrating the effectiveness of our approaches.

Advantages of Domain Knowledge Injection for Legal Document Summarization: A Case Study on Summarizing Indian Court Judgments in English and Hindi

TL;DR

This work tackles the challenge of summarizing Indian court judgments for English and Hindi readers by injecting legal-domain knowledge into both extractive and generative models. It leverages domain-adapted encoders (InLegalBERT) and continual pre-training on Indian legal corpora, including multilingual pre-training, to improve performance on MILDSum for EN-to-EN and EN-to-HI tasks. The authors demonstrate significant gains across standard relevance and factual-consistency metrics, supported by expert evaluation, and show that domain-specific models can surpass GPT-4 on these tasks. The findings highlight practical potential for accessible legal information and suggest that memory-efficient pre-training strategies can achieve comparable results with reduced resources.

Abstract

Summarizing Indian legal court judgments is a complex task not only due to the intricate language and unstructured nature of the legal texts, but also since a large section of the Indian population does not understand the complex English in which legal text is written, thus requiring summaries in Indian languages. In this study, we aim to improve the summarization of Indian legal text to generate summaries in both English and Hindi (the most widely spoken Indian language), by injecting domain knowledge into diverse summarization models. We propose a framework to enhance extractive neural summarization models by incorporating domain-specific pre-trained encoders tailored for legal texts. Further, we explore the injection of legal domain knowledge into generative models (including Large Language Models) through continual pre-training on large legal corpora in English and Hindi. Our proposed approaches achieve statistically significant improvements in both English-to-English and English-to-Hindi Indian legal document summarization, as measured by standard evaluation metrics, factual consistency metrics, and legal domain-specific metrics. Furthermore, these improvements are validated through domain experts, demonstrating the effectiveness of our approaches.
Paper Structure (26 sections, 4 figures, 8 tables)

This paper contains 26 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Architectures of SummaRuNNer and InLegalSumExt
  • Figure 2: We conduct continual pre-training over subsets of InLegalBERT-PT (for English) and Bail Corpora (for Hindi), then supervised fine-tuning over train and validation splits of MILDSum.
  • Figure 3: Example of sample data in English for span-corruption denoising pre-training objective.
  • Figure 5: Examples of errors in English and Hindi summaries, committed by the FT-only models, which are mitigated in the corresponding PT+FT models. The last column explains the error.