The Right Model for the Job: An Evaluation of Legal Multi-Label Classification Baselines
Martina Forster, Claudia Schulz, Prudhvi Nokku, Melicaalsadat Mirsafian, Jaykumar Kasundra, Stavroula Skylaki
TL;DR
This paper tackles the challenge of selecting effective baselines for legal multi-label classification under long-tail label distributions and varying data availability. It conducts a systematic, empirical comparison of sparse-vector methods, Transformer-based models (including DistilRoBERTa, LegalBERT, and T5), and similarity-based architectures (BiEncoder, CrossEncoder) on two legal datasets, POSTURE50K and EURLEX57K, while varying the training size $m$ and the number of labels $k$ to simulate practical data regimes. Key findings show that DistilRoBerta and LegalBERT typically offer the best cost-performance, with T5 providing competitive results when label sets evolve and CrossEncoder achieving notable macro-$F_1$ gains in some cases at a high computational cost; DocTFIDF also presents a fast, reasonable baseline. The work provides practical guidance for legal NLP practitioners to balance accuracy, speed, and scalability and highlights directions for future work, including long-document handling and LLM-based approaches for dynamic labeling.
Abstract
Multi-Label Classification (MLC) is a common task in the legal domain, where more than one label may be assigned to a legal document. A wide range of methods can be applied, ranging from traditional ML approaches to the latest Transformer-based architectures. In this work, we perform an evaluation of different MLC methods using two public legal datasets, POSTURE50K and EURLEX57K. By varying the amount of training data and the number of labels, we explore the comparative advantage offered by different approaches in relation to the dataset properties. Our findings highlight DistilRoBERTa and LegalBERT as performing consistently well in legal MLC with reasonable computational demands. T5 also demonstrates comparable performance while offering advantages as a generative model in the presence of changing label sets. Finally, we show that the CrossEncoder exhibits potential for notable macro-F1 score improvements, albeit with increased computational costs.
