Taxation Perspectives from Large Language Models: A Case Study on Additional Tax Penalties
Eunkyung Choi, Young Jin Suh, Hun Park, Wonseok Hwang
TL;DR
This paper introduces PLAT, a benchmark of 50 Korean precedents to evaluate LLMs on the justifiability of additional tax penalties, a topic that requires deep legal understanding beyond statute application. It demonstrates that vanilla LLMs perform modestly, with best results around $F_1$ = 0.75, and reveals specific reasoning blind spots such as handling conflicts between tax authority opinions and legitimate expectations. The authors show that retrieval augmentation, self-reasoning, and multi-agent collaboration with defined roles can boost performance by up to about 11% in $F_1$, especially on harder cases, though combining all features yields limited extra gain. The dataset, available in Korean and English under CC BY-NC, provides a challenging resource for advancing tax law reasoning in LLMs and highlights the potential of agent-based approaches for complex legal tasks.
Abstract
How capable are large language models (LLMs) in the domain of taxation? Although numerous studies have explored the legal domain in general, research dedicated to taxation remain scarce. Moreover, the datasets used in these studies are either simplified, failing to reflect the real-world complexities, or unavailable as open source. To address this gap, we introduce PLAT, a new benchmark designed to assess the ability of LLMs to predict the legitimacy of additional tax penalties. PLAT is constructed to evaluate LLMs' understanding of tax law, particularly in cases where resolving the issue requires more than just applying related statutes. Our experiments with six LLMs reveal that their baseline capabilities are limited, especially when dealing with conflicting issues that demand a comprehensive understanding. However, we found that enabling retrieval, self-reasoning, and discussion among multiple agents with specific role assignments, this limitation can be mitigated.
