Table of Contents
Fetching ...

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

Xi Xuan, Chunyu Kit

TL;DR

TransLaw tackles the challenge of translating Hong Kong case law by introducing a multi-agent system that decomposes translation into word-level terminology, sentence-level translation, and multi-dimensional review, augmented by a Hong Kong legal glossary and retrieval-augmented generation. The authors construct HKCFA Judgement 97-22, a sentence-aligned bilingual dataset of 344 judgments, to benchmark 13 LLMs and compare against a single translator baseline. Across automated metrics and human ACS scoring, TransLaw consistently outperforms single-agent approaches, though human experts still surpass it in terminology contextualization and stylistic naturalness. The work demonstrates that collaborative MAS approaches can improve legal translation quality at scale and provides a practical, cost-efficient framework that could reshape HK bilingual legal workflows.

Abstract

Hong Kong case law translation presents significant challenges: manual methods suffer from high costs and inconsistent quality, while both traditional machine translation and approaches relying solely on Large Language Models (LLMs) often fail to ensure legal terminology accuracy, culturally embedded nuances, and strict linguistic structures. To overcome these limitations, this study proposes TransLaw, a multi-agent framework that decomposes translation into word-level expression, sentence-level translation, and multidimensional review, integrating a specialized Hong Kong legal glossary database, Retrieval-Augmented Generation (RAG), and iterative feedback. Experiments on our newly constructed HKCFA Judgment 97-22 dataset, benchmarking 13 open-source and commercial LLMs, demonstrate that TransLaw significantly outperforms single-agent baselines across all evaluated models. Human evaluation confirms the framework's effectiveness in terms of legal semantic accuracy, structural coherence, and stylistic fidelity, while noting that it still trails human experts in contextualizing complex terminology and stylistic naturalness.

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

TL;DR

TransLaw tackles the challenge of translating Hong Kong case law by introducing a multi-agent system that decomposes translation into word-level terminology, sentence-level translation, and multi-dimensional review, augmented by a Hong Kong legal glossary and retrieval-augmented generation. The authors construct HKCFA Judgement 97-22, a sentence-aligned bilingual dataset of 344 judgments, to benchmark 13 LLMs and compare against a single translator baseline. Across automated metrics and human ACS scoring, TransLaw consistently outperforms single-agent approaches, though human experts still surpass it in terminology contextualization and stylistic naturalness. The work demonstrates that collaborative MAS approaches can improve legal translation quality at scale and provides a practical, cost-efficient framework that could reshape HK bilingual legal workflows.

Abstract

Hong Kong case law translation presents significant challenges: manual methods suffer from high costs and inconsistent quality, while both traditional machine translation and approaches relying solely on Large Language Models (LLMs) often fail to ensure legal terminology accuracy, culturally embedded nuances, and strict linguistic structures. To overcome these limitations, this study proposes TransLaw, a multi-agent framework that decomposes translation into word-level expression, sentence-level translation, and multidimensional review, integrating a specialized Hong Kong legal glossary database, Retrieval-Augmented Generation (RAG), and iterative feedback. Experiments on our newly constructed HKCFA Judgment 97-22 dataset, benchmarking 13 open-source and commercial LLMs, demonstrate that TransLaw significantly outperforms single-agent baselines across all evaluated models. Human evaluation confirms the framework's effectiveness in terms of legal semantic accuracy, structural coherence, and stylistic fidelity, while noting that it still trails human experts in contextualizing complex terminology and stylistic naturalness.

Paper Structure

This paper contains 33 sections, 7 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: The overall architecture of TransLaw. The framework consists of three collaborative modules: (1) the Translation Command Module ($\mathcal{M}_{\text{Com}}$), where $\mathcal{A}_{\text{Com}}$ coordinates the global workflow; (2) the Translation Execution Module ($\mathcal{M}_{\text{Exec}}$), comprising $\mathcal{A}_{\text{Term}}$ and $\mathcal{A}_{\text{Trans}}$ for legal terminology parsing and core translation; and (3) the Expert Review Module ($\mathcal{M}_{\text{Rev}}$), which integrates $\mathcal{A}_{\text{Align}}$, $\mathcal{A}_{\text{TermR}}$, $\mathcal{A}_{\text{Cita}}$, and $\mathcal{A}_{\text{StyleP}}$ for multi-dimensional quality verification.
  • Figure 2: Human Evaluation Results. Performance of the three systems across three dimensions (top) and different weighting schemes (bottom).
  • Figure 3: Example of the first page of a bilingual HKCFA judgment (Case No. FACC 1/2021), comparing the source text and target translation.
  • Figure 4: Prompt template for Senior Translation Project Manager. It covers segmentation, context memory, iterative feedback integration, and final result aggregation.
  • Figure 5: Prompt template for HK Legal Terminologist, it ensures initial terms are retrieved from official glossary.
  • ...and 6 more figures