Table of Contents
Fetching ...

CLASE: A Hybrid Method for Chinese Legalese Stylistic Evaluation

Yiran Rex Ma, Yuxiao Ye, Huiyuan Xie

TL;DR

Experiments show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods, offering a scalable and practical solution for professional stylistic evaluation in legal text generation.

Abstract

Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference-based metrics conflate semantic accuracy with stylistic fidelity, and LLM-as-a-judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism that combines 1) linguistic feature-based scores and 2) experience-guided LLM-as-a-judge scores. Both the feature coefficients and the LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM-restored counterparts. This hybrid design captures both surface-level features and implicit stylistic norms in a transparent, reference-free manner. Experiments on 200 Chinese legal documents show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods. Beyond improved alignment, CLASE provides interpretable score breakdowns and suggestions for improvements, offering a scalable and practical solution for professional stylistic evaluation in legal text generation (Code and data for CLASE is available at: https://github.com/rexera/CLASE).

CLASE: A Hybrid Method for Chinese Legalese Stylistic Evaluation

TL;DR

Experiments show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods, offering a scalable and practical solution for professional stylistic evaluation in legal text generation.

Abstract

Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference-based metrics conflate semantic accuracy with stylistic fidelity, and LLM-as-a-judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism that combines 1) linguistic feature-based scores and 2) experience-guided LLM-as-a-judge scores. Both the feature coefficients and the LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM-restored counterparts. This hybrid design captures both surface-level features and implicit stylistic norms in a transparent, reference-free manner. Experiments on 200 Chinese legal documents show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods. Beyond improved alignment, CLASE provides interpretable score breakdowns and suggestions for improvements, offering a scalable and practical solution for professional stylistic evaluation in legal text generation (Code and data for CLASE is available at: https://github.com/rexera/CLASE).
Paper Structure (28 sections, 5 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 28 sections, 5 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: Comparison of Authentic Legal Writing and LLM-Generated Counterpart.
  • Figure 2: CLASE Overview: (I) contrastive pair synthesization from authentic legal documents; (II) training-free contrastive learning to build positive/negative example pools; (III) hybrid scoring combining objective linguistic features with experience-guided LLM evaluation.
  • Figure 3: Correlation analysis between evaluation methods and human judgments. Points closer to the diagonal line indicate better alignment with human evaluation.
  • Figure 4: Ablation study on training set size and retrieval parameters for subjective scoring component, measured by Kendall's $\tau$.
  • Figure 5: Ablation study on training set size and significant feature count for objective scoring component, measured by Kendall's $\tau$.
  • ...and 1 more figures