The Right Model for the Job: An Evaluation of Legal Multi-Label Classification Baselines

Martina Forster; Claudia Schulz; Prudhvi Nokku; Melicaalsadat Mirsafian; Jaykumar Kasundra; Stavroula Skylaki

The Right Model for the Job: An Evaluation of Legal Multi-Label Classification Baselines

Martina Forster, Claudia Schulz, Prudhvi Nokku, Melicaalsadat Mirsafian, Jaykumar Kasundra, Stavroula Skylaki

TL;DR

This paper tackles the challenge of selecting effective baselines for legal multi-label classification under long-tail label distributions and varying data availability. It conducts a systematic, empirical comparison of sparse-vector methods, Transformer-based models (including DistilRoBERTa, LegalBERT, and T5), and similarity-based architectures (BiEncoder, CrossEncoder) on two legal datasets, POSTURE50K and EURLEX57K, while varying the training size $m$ and the number of labels $k$ to simulate practical data regimes. Key findings show that DistilRoBerta and LegalBERT typically offer the best cost-performance, with T5 providing competitive results when label sets evolve and CrossEncoder achieving notable macro-$F_1$ gains in some cases at a high computational cost; DocTFIDF also presents a fast, reasonable baseline. The work provides practical guidance for legal NLP practitioners to balance accuracy, speed, and scalability and highlights directions for future work, including long-document handling and LLM-based approaches for dynamic labeling.

Abstract

Multi-Label Classification (MLC) is a common task in the legal domain, where more than one label may be assigned to a legal document. A wide range of methods can be applied, ranging from traditional ML approaches to the latest Transformer-based architectures. In this work, we perform an evaluation of different MLC methods using two public legal datasets, POSTURE50K and EURLEX57K. By varying the amount of training data and the number of labels, we explore the comparative advantage offered by different approaches in relation to the dataset properties. Our findings highlight DistilRoBERTa and LegalBERT as performing consistently well in legal MLC with reasonable computational demands. T5 also demonstrates comparable performance while offering advantages as a generative model in the presence of changing label sets. Finally, we show that the CrossEncoder exhibits potential for notable macro-F1 score improvements, albeit with increased computational costs.

The Right Model for the Job: An Evaluation of Legal Multi-Label Classification Baselines

TL;DR

and the number of labels

to simulate practical data regimes. Key findings show that DistilRoBerta and LegalBERT typically offer the best cost-performance, with T5 providing competitive results when label sets evolve and CrossEncoder achieving notable macro-

gains in some cases at a high computational cost; DocTFIDF also presents a fast, reasonable baseline. The work provides practical guidance for legal NLP practitioners to balance accuracy, speed, and scalability and highlights directions for future work, including long-document handling and LLM-based approaches for dynamic labeling.

Abstract

Paper Structure (20 sections, 4 figures, 1 table)

This paper contains 20 sections, 4 figures, 1 table.

Introduction
Related Work
Experiments
Dataset Construction
MLC Models
Results
What is the influence of dataset size?
What is the influence of label quantity?
Are domain-specific models better?
What are the best legal MLC baselines?
What is the cost-performance trade-off?
Conclusions and Future Work
Appendix
Datasets
POSTURE50K
...and 5 more sections

Figures (4)

Figure 1: Micro- and macro-F1 scores of multi-label classifiers on POSTURE50K data with top 20 and top 200 labels for different training set sizes.
Figure 2: Micro- and macro-F1 scores of multi-label classifiers on EURLEX57K data with top 20 and top 200 labels for different training set sizes.
Figure 3: Micro- and macro-F1 scores of multi-label classifiers on EURLEX57K data with top 1000 labels for different training set sizes.
Figure 4: Performance of multi-label classifiers on the full POSTURE50K (31,944 data points) and EURLEX57K (45,000 data points) data with varying label quantities.

The Right Model for the Job: An Evaluation of Legal Multi-Label Classification Baselines

TL;DR

Abstract

The Right Model for the Job: An Evaluation of Legal Multi-Label Classification Baselines

Authors

TL;DR

Abstract

Table of Contents

Figures (4)