Table of Contents
Fetching ...

EMCee: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context

Hamin Koo, Jaehyung Kim

TL;DR

EMCee tackles the English-centric bias of LLMs by introducing a self-contained prompting framework that explicitly extracts query-relevant synthetic multilingual context from the LLM and merges it with reasoning through an LLM-as-a-Judge. It avoids external retrieval, leveraging internal knowledge to address knowledge-intensive multilingual queries that translation-based methods struggle with. Across four multilingual benchmarks spanning 24 languages, EMCee yields substantial improvements, especially in low-resource settings, with an average relative gain of 16.4% overall and 31.7% in low-resource languages. This work demonstrates that explicit knowledge extraction and context-driven merging can robustly enhance multilingual capabilities and suggests a path toward more inclusive, culturally-grounded prompt design without external data dependencies.

Abstract

Large Language Models (LLMs) have achieved impressive progress across a wide range of tasks, yet their heavy reliance on English-centric training data leads to significant performance degradation in non-English languages. While existing multilingual prompting methods emphasize reformulating queries into English or enhancing reasoning capabilities, they often fail to incorporate the language- and culture-specific grounding that is essential for some queries. To address this limitation, we propose EMCee (Extracting synthetic Multilingual Context and merging), a simple yet effective framework that enhances the multilingual capabilities of LLMs by explicitly extracting and utilizing query-relevant knowledge from the LLM itself. In particular, EMCee first extracts synthetic context to uncover latent, language-specific knowledge encoded within the LLM, and then dynamically merges this contextual insight with reasoning-oriented outputs through a judgment-based selection mechanism. Extensive experiments on four multilingual benchmarks covering diverse languages and tasks demonstrate that EMCee consistently outperforms prior approaches, achieving an average relative improvement of 16.4% overall and 31.7% in low-resource languages.

EMCee: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context

TL;DR

EMCee tackles the English-centric bias of LLMs by introducing a self-contained prompting framework that explicitly extracts query-relevant synthetic multilingual context from the LLM and merges it with reasoning through an LLM-as-a-Judge. It avoids external retrieval, leveraging internal knowledge to address knowledge-intensive multilingual queries that translation-based methods struggle with. Across four multilingual benchmarks spanning 24 languages, EMCee yields substantial improvements, especially in low-resource settings, with an average relative gain of 16.4% overall and 31.7% in low-resource languages. This work demonstrates that explicit knowledge extraction and context-driven merging can robustly enhance multilingual capabilities and suggests a path toward more inclusive, culturally-grounded prompt design without external data dependencies.

Abstract

Large Language Models (LLMs) have achieved impressive progress across a wide range of tasks, yet their heavy reliance on English-centric training data leads to significant performance degradation in non-English languages. While existing multilingual prompting methods emphasize reformulating queries into English or enhancing reasoning capabilities, they often fail to incorporate the language- and culture-specific grounding that is essential for some queries. To address this limitation, we propose EMCee (Extracting synthetic Multilingual Context and merging), a simple yet effective framework that enhances the multilingual capabilities of LLMs by explicitly extracting and utilizing query-relevant knowledge from the LLM itself. In particular, EMCee first extracts synthetic context to uncover latent, language-specific knowledge encoded within the LLM, and then dynamically merges this contextual insight with reasoning-oriented outputs through a judgment-based selection mechanism. Extensive experiments on four multilingual benchmarks covering diverse languages and tasks demonstrate that EMCee consistently outperforms prior approaches, achieving an average relative improvement of 16.4% overall and 31.7% in low-resource languages.

Paper Structure

This paper contains 49 sections, 12 figures, 19 tables.

Figures (12)

  • Figure 1: Different multilingual prompting methods. Given a Vietnamese query from social science category in M3-Exam zhang2023m3exam, (a) Translating the query into English using the external translator results in an incorrect answer. (b) Even with retrieval-augmented generation (Google Custom Search), the model remains incorrect. (c) However, with EMCee (Ours) prompt that extracts relevant context from LLM itself, the LLM successfully produces the correct answer.
  • Figure 2: Overview of EMCee. (a) LLM receives a non-English (native) query along with an instruction to extract relevant synthetic context, producing a context-enriched response. (b) In parallel, LLM generates a reasoning-focused response using only its inherent knowledge, without additional context. (c) An LLM-as-a-Judge module then compares the two responses and selects the final answer based on contextual relevance and reasoning adequacy.
  • Figure 3: Input prompt and LLM response example. These figures illustrates the overview of input prompt and response of LLM during (a) Extracting and (b) Merging processes, respectively.
  • Figure 4: Overall language-wise improvement. Test accuracy of GPT-4o-mini over four different multilingual prompting methods on M3-Exam. More results on other datasets and LLMs are presented in Appendix \ref{['supp:analyses']}.
  • Figure 5: Qualitative example from M3-Exam. This figure shows: (1) native Javanese questions and options, (2) the translated one by GPT-4o, (3) incorrect responses produced by Eng-CoT, (4) correct responses obtained through our cultural knowledge extraction, and (5) an EMCee response that integrates and compares the outputs from (3) and (4). More qualitative examples are provided in Appendix \ref{['supp:examples']}.
  • ...and 7 more figures