Steering LLMs toward Korean Local Speech: Iterative Refinement Framework for Faithful Dialect Translation
Keunhyeung Park, Seunguk Yu, Youngbin Kim
TL;DR
This work tackles the challenge of translating standard Korean into regional dialects by countering the dialect gap in LLMs and evaluation biases inherent in n-gram metrics. It introduces DIA-REFINE, an iterative translation-verification-feedback framework that uses an ensemble of external dialect classifiers to steer outputs toward Jeolla, Gyeongsang, and Jeju dialects. To address evaluation distortions, the authors propose Dialect Fidelity Score (DFS) and Target Dialect Ratio (TDR) as complementary metrics to traditional n-gram measures, enabling more faithful assessment of dialect translation. Empirical results show that DIA-REFINE improves dialect fidelity, with the multi-candidate variant delivering the strongest gains, particularly when combined with in-context examples. The framework and metrics offer a robust path toward inclusive, goal-directed dialect translation and can be extended to additional languages and dialectal varieties.
Abstract
Standard-to-dialect machine translation remains challenging due to a persistent dialect gap in large language models and evaluation distortions inherent in n-gram metrics, which favor source copying over authentic dialect translation. In this paper, we propose the dialect refinement (DIA-REFINE) framework, which guides LLMs toward faithful target dialect outputs through an iterative loop of translation, verification, and feedback using external dialect classifiers. To address the limitations of n-gram-based metrics, we introduce the dialect fidelity score (DFS) to quantify linguistic shift and the target dialect ratio (TDR) to measure the success of dialect translation. Experiments on Korean dialects across zero-shot and in-context learning baselines demonstrate that DIA-REFINE consistently enhances dialect fidelity. The proposed metrics distinguish between False Success cases, where high n-gram scores obscure failures in dialectal translation, and True Attempt cases, where genuine attempts at dialectal translation yield low n-gram scores. We also observed that models exhibit varying degrees of responsiveness to the framework, and that integrating in-context examples further improves the translation of dialectal expressions. Our work establishes a robust framework for goal-directed, inclusive dialect translation, providing both rigorous evaluation and critical insights into model performance.
