LASTIST: LArge-Scale Target-Independent STance dataset
DongJae Kim, Yaejin Lee, Minsu Park, Eunil Park
TL;DR
LASTIST addresses the lack of large-scale, target-independent stance datasets in Korean by introducing 563,299 labeled sentences collected from party press releases. The authors employ an active-learning data collection pipeline and KoBERT-based contrastive learning to create and benchmark a target-independent stance task, plus a single-target subset. Key findings show that target-independent stance detection is substantially more challenging than single-target detection, as reflected in lower accuracy and AUROC, highlighting the need for specialized modeling and larger multilingual resources. The dataset and code enable further research on cross-linguistic bias, low-resource language NLP, and robust stance analysis in political discourse.
Abstract
Stance detection has emerged as an area of research in the field of artificial intelligence. However, most research is currently centered on the target-dependent stance detection task, which is based on a person's stance in favor of or against a specific target. Furthermore, most benchmark datasets are based on English, making it difficult to develop models in low-resource languages such as Korean, especially for an emerging field such as stance detection. This study proposes the LArge-Scale Target-Independent STance (LASTIST) dataset to fill this research gap. Collected from the press releases of both parties on Korean political parties, the LASTIST dataset uses 563,299 labeled Korean sentences. We provide a detailed description of how we collected and constructed the dataset and trained state-of-the-art deep learning and stance detection models. Our LASTIST dataset is designed for various tasks in stance detection, including target-independent stance detection and diachronic evolution stance detection. We deploy our dataset on https://anonymous.4open.science/r/LASTIST-3721/.
