Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction
Ivan Leonidovich Litvak, Anton Kostin, Fedor Lashkin, Tatiana Maksiyan, Sergey Lagutin
TL;DR
This work tackles the challenge of evaluating legal text extraction without annotated ground truth by proposing and validating 16 unsupervised metrics on seven semantic blocks extracted from 1,000 Russian judicial decisions. The methodology combines document-based, semantic, structural, pseudo-ground truth, and legal-specific metrics, validated against 7,168 expert reviews, and uses a robust statistical pipeline (including bootstrapping, Lin's CCC, and MAE) to quantify alignment with human judgments. Key findings show that Term Frequency Coherence and Coverage Ratio/Block Completeness achieve the strongest agreement with experts among document-level metrics, while Semantic Entropy also provides robust signal; several metrics exhibit negative correlations (e.g., Legal Term Density), and LLM-based evaluation yields moderate alignment, underscoring the limits of current models for precise legal assessment. Overall, the study demonstrates the practicality of annotation-free evaluation for scalable judicial analytics while highlighting the need for domain-adapted representations and human oversight in high-stakes legal contexts.
Abstract
The rapid advancement of artificial intelligence in legal natural language processing demands scalable methods for evaluating text extraction from judicial decisions. This study evaluates 16 unsupervised metrics, including novel formulations, to assess the quality of extracting seven semantic blocks from 1,000 anonymized Russian judicial decisions, validated against 7,168 expert reviews on a 1--5 Likert scale. These metrics, spanning document-based, semantic, structural, pseudo-ground truth, and legal-specific categories, operate without pre-annotated ground truth. Bootstrapped correlations, Lin's concordance correlation coefficient (CCC), and mean absolute error (MAE) reveal that Term Frequency Coherence (Pearson $r = 0.540$, Lin CCC = 0.512, MAE = 0.127) and Coverage Ratio/Block Completeness (Pearson $r = 0.513$, Lin CCC = 0.443, MAE = 0.139) best align with expert ratings, while Legal Term Density (Pearson $r = -0.479$, Lin CCC = -0.079, MAE = 0.394) show strong negative correlations. The LLM Evaluation Score (mean = 0.849, Pearson $r = 0.382$, Lin CCC = 0.325, MAE = 0.197) showed moderate alignment, but its performance, using gpt-4.1-mini via g4f, suggests limited specialization for legal textse. These findings highlight that unsupervised metrics, including LLM-based approaches, enable scalable screening but, with moderate correlations and low CCC values, cannot fully replace human judgment in high-stakes legal contexts. This work advances legal NLP by providing annotation-free evaluation tools, with implications for judicial analytics and ethical AI deployment.
