A Korean Legal Judgment Prediction Dataset for Insurance Disputes
Alice Saebom Kwak, Cheonkam Jeong, Ji Weon Lim, Byeongcheol Min
TL;DR
The paper tackles the scarcity of Korean LJP data by introducing a publicly available dataset for insurance disputes, consisting of $473$ cases with $231K$ tokens and structured as facts, claims, and mediation results with binary labels $0$ or $1$. It evaluates several learning approaches under a low-resource regime, identifying Sentence Transformer Fine-tuning (SetFit) with paraphrase-mpnet-base-v2 as particularly data-efficient, achieving $70.5%$ accuracy and closely matching a larger Korean LJP benchmark despite the smaller dataset. The dataset, sourced from the Financial Supervisory Service and Korea Consumer Agency, includes thorough preprocessing and anonymization to ensure privacy. Overall, the study demonstrates that sample-efficient methods like SetFit can deliver competitive LJP performance in languages with limited data, enabling practical mediation-outcome prediction in Korean insurance disputes.
Abstract
This paper introduces a Korean legal judgment prediction (LJP) dataset for insurance disputes. Successful LJP models on insurance disputes can benefit insurance companies and their customers. It can save both sides' time and money by allowing them to predict how the result would come out if they proceed to the dispute mediation process. As is often the case with low-resource languages, there is a limitation on the amount of data available for this specific task. To mitigate this issue, we investigate how one can achieve a good performance despite the limitation in data. In our experiment, we demonstrate that Sentence Transformer Fine-tuning (SetFit, Tunstall et al., 2022) is a good alternative to standard fine-tuning when training data are limited. The models fine-tuned with the SetFit approach on our data show similar performance to the Korean LJP benchmark models (Hwang et al., 2022) despite the much smaller data size.
