TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

Xi Xuan; Chunyu Kit

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

Xi Xuan, Chunyu Kit

TL;DR

TransLaw tackles the challenge of translating Hong Kong case law by introducing a multi-agent system that decomposes translation into word-level terminology, sentence-level translation, and multi-dimensional review, augmented by a Hong Kong legal glossary and retrieval-augmented generation. The authors construct HKCFA Judgement 97-22, a sentence-aligned bilingual dataset of 344 judgments, to benchmark 13 LLMs and compare against a single translator baseline. Across automated metrics and human ACS scoring, TransLaw consistently outperforms single-agent approaches, though human experts still surpass it in terminology contextualization and stylistic naturalness. The work demonstrates that collaborative MAS approaches can improve legal translation quality at scale and provides a practical, cost-efficient framework that could reshape HK bilingual legal workflows.

Abstract

Hong Kong case law translation presents significant challenges: manual methods suffer from high costs and inconsistent quality, while both traditional machine translation and approaches relying solely on Large Language Models (LLMs) often fail to ensure legal terminology accuracy, culturally embedded nuances, and strict linguistic structures. To overcome these limitations, this study proposes TransLaw, a multi-agent framework that decomposes translation into word-level expression, sentence-level translation, and multidimensional review, integrating a specialized Hong Kong legal glossary database, Retrieval-Augmented Generation (RAG), and iterative feedback. Experiments on our newly constructed HKCFA Judgment 97-22 dataset, benchmarking 13 open-source and commercial LLMs, demonstrate that TransLaw significantly outperforms single-agent baselines across all evaluated models. Human evaluation confirms the framework's effectiveness in terms of legal semantic accuracy, structural coherence, and stylistic fidelity, while noting that it still trails human experts in contextualizing complex terminology and stylistic naturalness.

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

TL;DR

Abstract

TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)