Unveiling Themes in Judicial Proceedings: A Cross-Country Study Using Topic Modeling on Legal Documents from India and the UK
Krish Didwania, Durga Toshniwal, Amit Agarwal
TL;DR
The paper addresses the practical challenge of annotating large volumes of legal documents across jurisdictions. It deploys three topic modeling methods—LDA, NMF, and BerTopic—to Indian and UK Supreme Court corpora, with careful preprocessing, input segmentation, and coherence-based evaluation, plus a timeline analysis for India. Key contributions include the first cross-country comparison of long-form legal texts with temporal dynamics, empirical performance benchmarking across the three models, and insights into jurisdiction-specific topic distributions. The work offers a path toward automated, scalable annotation of legal records and informs policy-relevant understanding of judicial discourse across India and the UK.
Abstract
Legal documents are indispensable in every country for legal practices and serve as the primary source of information regarding previous cases and employed statutes. In today's world, with an increasing number of judicial cases, it is crucial to systematically categorize past cases into subgroups, which can then be utilized for upcoming cases and practices. Our primary focus in this endeavor was to annotate cases using topic modeling algorithms such as Latent Dirichlet Allocation, Non-Negative Matrix Factorization, and Bertopic for a collection of lengthy legal documents from India and the UK. This step is crucial for distinguishing the generated labels between the two countries, highlighting the differences in the types of cases that arise in each jurisdiction. Furthermore, an analysis of the timeline of cases from India was conducted to discern the evolution of dominant topics over the years.
