Table of Contents
Fetching ...

Evaluation Ethics of LLMs in Legal Domain

Ruizhe Zhang, Haitao Li, Yueyue Wu, Qingyao Ai, Yiqun Liu, Min Zhang, Shaoping Ma

TL;DR

This work argues that large language models require domain-specific ethics evaluation before deployment in law. It introduces a three-dimensional evaluation framework—instruction following, legal knowledge, and robustness—and tests a range of general and legal-specialized LLMs using authentic LeCaRD criminal judgments. Findings show that GPT-4 and Qwen-Chat exhibit strong instruction-following and relatively lower bias, while several models suffer from gender, age, and career biases and reduced resistance to inducements, revealing substantial robustness gaps. The study provides a methodological blueprint and empirical insights to inform safer, more reliable LLM deployment in the legal domain and highlights implications for policy and future research.

Abstract

In recent years, the utilization of large language models for natural language dialogue has gained momentum, leading to their widespread adoption across various domains. However, their universal competence in addressing challenges specific to specialized fields such as law remains a subject of scrutiny. The incorporation of legal ethics into the model has been overlooked by researchers. We asserts that rigorous ethic evaluation is essential to ensure the effective integration of large language models in legal domains, emphasizing the need to assess domain-specific proficiency and domain-specific ethic. To address this, we propose a novelty evaluation methodology, utilizing authentic legal cases to evaluate the fundamental language abilities, specialized legal knowledge and legal robustness of large language models (LLMs). The findings from our comprehensive evaluation contribute significantly to the academic discourse surrounding the suitability and performance of large language models in legal domains.

Evaluation Ethics of LLMs in Legal Domain

TL;DR

This work argues that large language models require domain-specific ethics evaluation before deployment in law. It introduces a three-dimensional evaluation framework—instruction following, legal knowledge, and robustness—and tests a range of general and legal-specialized LLMs using authentic LeCaRD criminal judgments. Findings show that GPT-4 and Qwen-Chat exhibit strong instruction-following and relatively lower bias, while several models suffer from gender, age, and career biases and reduced resistance to inducements, revealing substantial robustness gaps. The study provides a methodological blueprint and empirical insights to inform safer, more reliable LLM deployment in the legal domain and highlights implications for policy and future research.

Abstract

In recent years, the utilization of large language models for natural language dialogue has gained momentum, leading to their widespread adoption across various domains. However, their universal competence in addressing challenges specific to specialized fields such as law remains a subject of scrutiny. The incorporation of legal ethics into the model has been overlooked by researchers. We asserts that rigorous ethic evaluation is essential to ensure the effective integration of large language models in legal domains, emphasizing the need to assess domain-specific proficiency and domain-specific ethic. To address this, we propose a novelty evaluation methodology, utilizing authentic legal cases to evaluate the fundamental language abilities, specialized legal knowledge and legal robustness of large language models (LLMs). The findings from our comprehensive evaluation contribute significantly to the academic discourse surrounding the suitability and performance of large language models in legal domains.
Paper Structure (19 sections, 1 equation, 9 tables)