Discovering Significant Topics from Legal Decisions with Selective Inference
Jerrold Soh
TL;DR
This paper presents a four-step automated pipeline that discovers significant topics in legal decision texts by synthesising documents with topic models (LSA and BERTopic variants) and linking them to outcomes via penalised regression with post-selection inference. By masking outcome-revealing content and applying selective inference to LASSO-selected topics, the method identifies topics correlated with outcomes and surfaces representative cases and key n-grams for interpretation. Evaluations on UDRP and ECHR datasets show that topic features, particularly from LSA, improve model fit and yield interpretable, doctrine-consistent topics, illustrating generalisability across legal domains. The approach offers a transparent, scalable way to extract meaningful legal predictors from unstructured texts, with potential for broader application in legal analytics and doctrinal interpretation.
Abstract
We propose and evaluate an automated pipeline for discovering significant topics from legal decision texts by passing features synthesized with topic models through penalised regressions and post-selection significance tests. The method identifies case topics significantly correlated with outcomes, topic-word distributions which can be manually-interpreted to gain insights about significant topics, and case-topic weights which can be used to identify representative cases for each topic. We demonstrate the method on a new dataset of domain name disputes and a canonical dataset of European Court of Human Rights violation cases. Topic models based on latent semantic analysis as well as language model embeddings are evaluated. We show that topics derived by the pipeline are consistent with legal doctrines in both areas and can be useful in other related legal analysis tasks.
