Table of Contents
Fetching ...

A Korean Legal Judgment Prediction Dataset for Insurance Disputes

Alice Saebom Kwak, Cheonkam Jeong, Ji Weon Lim, Byeongcheol Min

TL;DR

The paper tackles the scarcity of Korean LJP data by introducing a publicly available dataset for insurance disputes, consisting of $473$ cases with $231K$ tokens and structured as facts, claims, and mediation results with binary labels $0$ or $1$. It evaluates several learning approaches under a low-resource regime, identifying Sentence Transformer Fine-tuning (SetFit) with paraphrase-mpnet-base-v2 as particularly data-efficient, achieving $70.5%$ accuracy and closely matching a larger Korean LJP benchmark despite the smaller dataset. The dataset, sourced from the Financial Supervisory Service and Korea Consumer Agency, includes thorough preprocessing and anonymization to ensure privacy. Overall, the study demonstrates that sample-efficient methods like SetFit can deliver competitive LJP performance in languages with limited data, enabling practical mediation-outcome prediction in Korean insurance disputes.

Abstract

This paper introduces a Korean legal judgment prediction (LJP) dataset for insurance disputes. Successful LJP models on insurance disputes can benefit insurance companies and their customers. It can save both sides' time and money by allowing them to predict how the result would come out if they proceed to the dispute mediation process. As is often the case with low-resource languages, there is a limitation on the amount of data available for this specific task. To mitigate this issue, we investigate how one can achieve a good performance despite the limitation in data. In our experiment, we demonstrate that Sentence Transformer Fine-tuning (SetFit, Tunstall et al., 2022) is a good alternative to standard fine-tuning when training data are limited. The models fine-tuned with the SetFit approach on our data show similar performance to the Korean LJP benchmark models (Hwang et al., 2022) despite the much smaller data size.

A Korean Legal Judgment Prediction Dataset for Insurance Disputes

TL;DR

The paper tackles the scarcity of Korean LJP data by introducing a publicly available dataset for insurance disputes, consisting of cases with tokens and structured as facts, claims, and mediation results with binary labels or . It evaluates several learning approaches under a low-resource regime, identifying Sentence Transformer Fine-tuning (SetFit) with paraphrase-mpnet-base-v2 as particularly data-efficient, achieving accuracy and closely matching a larger Korean LJP benchmark despite the smaller dataset. The dataset, sourced from the Financial Supervisory Service and Korea Consumer Agency, includes thorough preprocessing and anonymization to ensure privacy. Overall, the study demonstrates that sample-efficient methods like SetFit can deliver competitive LJP performance in languages with limited data, enabling practical mediation-outcome prediction in Korean insurance disputes.

Abstract

This paper introduces a Korean legal judgment prediction (LJP) dataset for insurance disputes. Successful LJP models on insurance disputes can benefit insurance companies and their customers. It can save both sides' time and money by allowing them to predict how the result would come out if they proceed to the dispute mediation process. As is often the case with low-resource languages, there is a limitation on the amount of data available for this specific task. To mitigate this issue, we investigate how one can achieve a good performance despite the limitation in data. In our experiment, we demonstrate that Sentence Transformer Fine-tuning (SetFit, Tunstall et al., 2022) is a good alternative to standard fine-tuning when training data are limited. The models fine-tuned with the SetFit approach on our data show similar performance to the Korean LJP benchmark models (Hwang et al., 2022) despite the much smaller data size.
Paper Structure (8 sections, 1 figure, 2 tables)

This paper contains 8 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The figure above shows the structure of our dataset. The dataset consists of three parts: facts, claims, and mediation results. Facts and claims serve as the input texts and mediation results are the labels.