KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding

Bokwang Hwang; Seonkyu Lim; Taewoong Kim; Yongjae Geun; Sunghyun Bang; Sohyun Park; Jihyun Park; Myeonggyu Lee; Jinwoo Lee; Yerin Kim; Jinsun Yoo; Jingyeong Hong; Jina Park; Yongchan Kim; Suhyun Kim; Younggyun Hahm; Yiseul Lee; Yejee Kang; Chanhyuk Yoon; Chansu Lee; Heeyewon Jeong; Jiyeon Lee; Seonhye Gu; Hyebin Kang; Yousang Cho; Hangyeol Yoo; KyungTae Lim

KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding

Bokwang Hwang, Seonkyu Lim, Taewoong Kim, Yongjae Geun, Sunghyun Bang, Sohyun Park, Jihyun Park, Myeonggyu Lee, Jinwoo Lee, Yerin Kim, Jinsun Yoo, Jingyeong Hong, Jina Park, Yongchan Kim, Suhyun Kim, Younggyun Hahm, Yiseul Lee, Yejee Kang, Chanhyuk Yoon, Chansu Lee, Heeyewon Jeong, Jiyeon Lee, Seonhye Gu, Hyebin Kang, Yousang Cho, Hangyeol Yoo, KyungTae Lim

TL;DR

KFinEval-Pilot targets the gap in Korean-finance AI evaluation by introducing a domain-specific benchmark with $1{,}145$ instances across knowledge, reasoning, and toxicity. Built via a semi-automated pipeline that combines GPT-4o-generated prompts with expert validation, it ensures linguistic and regulatory alignment with Korea’s financial landscape. Experimental results reveal notable gaps in reasoning and safety across models, with domain-tuned and proprietary systems offering stronger performance but still highlighting risks in high-stakes finance. The benchmark serves as an early diagnostic tool to drive safer, more reliable financial AI in Korea, with plans to broaden coverage and publish the dataset through Datop for community use.

Abstract

We introduce KFinEval-Pilot, a benchmark suite specifically designed to evaluate large language models (LLMs) in the Korean financial domain. Addressing the limitations of existing English-centric benchmarks, KFinEval-Pilot comprises over 1,000 curated questions across three critical areas: financial knowledge, legal reasoning, and financial toxicity. The benchmark is constructed through a semi-automated pipeline that combines GPT-4-generated prompts with expert validation to ensure domain relevance and factual accuracy. We evaluate a range of representative LLMs and observe notable performance differences across models, with trade-offs between task accuracy and output safety across different model families. These results highlight persistent challenges in applying LLMs to high-stakes financial applications, particularly in reasoning and safety. Grounded in real-world financial use cases and aligned with the Korean regulatory and linguistic context, KFinEval-Pilot serves as an early diagnostic tool for developing safer and more reliable financial AI systems.

KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding

TL;DR

Abstract

KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)