Table of Contents
Fetching ...

A Cross-Lingual Statutory Article Retrieval Dataset for Taiwan Legal Studies

Yen-Hsiang Wang, Feng-Dian Su, Tzu-Yu Yeh, Yao-Chung Fan

TL;DR

This paper introduces a cross-lingual statutory article retrieval (SAR) dataset designed to enhance legal information retrieval in multilingual settings, focusing on mitigating translation errors and improving cross-lingual retrieval performance.

Abstract

This paper introduces a cross-lingual statutory article retrieval (SAR) dataset designed to enhance legal information retrieval in multilingual settings. Our dataset features spoken-language-style legal inquiries in English, paired with corresponding Chinese versions and relevant statutes, covering all Taiwanese civil, criminal, and administrative laws. This dataset aims to improve access to legal information for non-native speakers, particularly for foreign nationals in Taiwan. We propose several LLM-based methods as baselines for evaluating retrieval effectiveness, focusing on mitigating translation errors and improving cross-lingual retrieval performance. Our work provides a valuable resource for developing inclusive legal information retrieval systems.

A Cross-Lingual Statutory Article Retrieval Dataset for Taiwan Legal Studies

TL;DR

This paper introduces a cross-lingual statutory article retrieval (SAR) dataset designed to enhance legal information retrieval in multilingual settings, focusing on mitigating translation errors and improving cross-lingual retrieval performance.

Abstract

This paper introduces a cross-lingual statutory article retrieval (SAR) dataset designed to enhance legal information retrieval in multilingual settings. Our dataset features spoken-language-style legal inquiries in English, paired with corresponding Chinese versions and relevant statutes, covering all Taiwanese civil, criminal, and administrative laws. This dataset aims to improve access to legal information for non-native speakers, particularly for foreign nationals in Taiwan. We propose several LLM-based methods as baselines for evaluating retrieval effectiveness, focusing on mitigating translation errors and improving cross-lingual retrieval performance. Our work provides a valuable resource for developing inclusive legal information retrieval systems.

Paper Structure

This paper contains 23 sections, 2 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: An example in our cross-lingual SAR QA dataset, each instance includes a query and answer in different languages from the data source.
  • Figure 2: Architecture Diagram of our Cross-lingual Statutory Article Retrieval. The top Branch is Sparse Retrieval, which translates the query into the same language within the corpus and then uses a term-based retrieval method. Middle Branch is Dense Retrieval, which directly uses a multi-lingual embedding model for retrieval. The bottom branch, LLM-Augmented Retrieval, leverages large language models for query expansion and dense retrieval methods and then searches the corpus to retrieve the top-K relevant documents.
  • Figure 3: The distribution of relevant laws in our human-labeled dataset, i.e., the Statutory Article corresponding to each question. There are too many items below two percent, so they are not displayed in the chart. For the exact number, please refer to our dataset LawFactsQA-TW.
  • Figure 4: The distribution of relevant laws in our synthetic dataset is relatively balanced because the data generation was based on referencing the search rankings.
  • Figure 5: Architecture Diagram for Re-ranking Statutory Article Retrieval Results Based on GPT-4.
  • ...and 4 more figures