Table of Contents
Fetching ...

Language Specific Knowledge: Do Models Know Better in X than in English?

Ishika Agarwal, Nimet Beyza Bozdag, Dilek Hakkani-Tür

TL;DR

Challenging the latent language alignment hypothesis, the paper defines Language Specific Knowledge (LSK) as knowledge best accessed in an expert language for a given LLM and proposes LSKExtractor to map topics to expert languages and exploit this mapping during inference. The framework operates in two stages: (i) map training queries into semantic clusters and assign an expert language per cluster based on cross-language chain-of-thought performance, and (ii) at test time, identify the closest cluster for a new query and perform reasoning in the cluster’s expert language. Evaluations across CultureAtlas, BLEnD, and SocialIQa with multiple instruction-tuned models show up to 10% relative improvements and competitive baselines, demonstrating transferability of learned LSK maps across models and datasets. The work advocates for inclusive, culturally aware multilingual reasoning and provides open-source tooling to facilitate practical adoption and further research.

Abstract

Often, multilingual language models are trained with the objective to map semantically similar content (in different languages) in the same latent space. In this paper, we show a nuance in this training objective, and find that by changing the language of the input query, we can improve the question answering ability of language models. Our contributions are two-fold. First, we introduce the term Language Specific Knowledge (LSK) to denote queries that are best answered in an "expert language" for a given LLM, thereby enhancing its question-answering ability. We introduce the problem of language selection -- for some queries, language models can perform better when queried in languages other than English, sometimes even better in low-resource languages -- and the goal is to select the optimal language for the query. Second, we introduce simple to strong baselines to test this problem. Additionally, as a first-pass solution to this novel problem, we design LSKExtractor to benchmark the language-specific knowledge present in a language model and then exploit it during inference. To test our framework, we employ three datasets that contain knowledge about both cultural and social behavioral norms. Overall, LSKExtractor achieves up to 10% relative improvement across datasets, and is competitive against strong baselines, while being feasible in real-world settings. Broadly, our research contributes to the open-source development (https://github.com/agarwalishika/LSKExtractor/tree/main) of language models that are inclusive and more aligned with the cultural and linguistic contexts in which they are deployed.

Language Specific Knowledge: Do Models Know Better in X than in English?

TL;DR

Challenging the latent language alignment hypothesis, the paper defines Language Specific Knowledge (LSK) as knowledge best accessed in an expert language for a given LLM and proposes LSKExtractor to map topics to expert languages and exploit this mapping during inference. The framework operates in two stages: (i) map training queries into semantic clusters and assign an expert language per cluster based on cross-language chain-of-thought performance, and (ii) at test time, identify the closest cluster for a new query and perform reasoning in the cluster’s expert language. Evaluations across CultureAtlas, BLEnD, and SocialIQa with multiple instruction-tuned models show up to 10% relative improvements and competitive baselines, demonstrating transferability of learned LSK maps across models and datasets. The work advocates for inclusive, culturally aware multilingual reasoning and provides open-source tooling to facilitate practical adoption and further research.

Abstract

Often, multilingual language models are trained with the objective to map semantically similar content (in different languages) in the same latent space. In this paper, we show a nuance in this training objective, and find that by changing the language of the input query, we can improve the question answering ability of language models. Our contributions are two-fold. First, we introduce the term Language Specific Knowledge (LSK) to denote queries that are best answered in an "expert language" for a given LLM, thereby enhancing its question-answering ability. We introduce the problem of language selection -- for some queries, language models can perform better when queried in languages other than English, sometimes even better in low-resource languages -- and the goal is to select the optimal language for the query. Second, we introduce simple to strong baselines to test this problem. Additionally, as a first-pass solution to this novel problem, we design LSKExtractor to benchmark the language-specific knowledge present in a language model and then exploit it during inference. To test our framework, we employ three datasets that contain knowledge about both cultural and social behavioral norms. Overall, LSKExtractor achieves up to 10% relative improvement across datasets, and is competitive against strong baselines, while being feasible in real-world settings. Broadly, our research contributes to the open-source development (https://github.com/agarwalishika/LSKExtractor/tree/main) of language models that are inclusive and more aligned with the cultural and linguistic contexts in which they are deployed.

Paper Structure

This paper contains 29 sections, 18 figures, 2 tables.

Figures (18)

  • Figure 1: In this toy experiment, we prompt Llama-3.1-8B-Instruct with the same question across multiple languages (shown in English here only for illustration; the actual queries were translated into each respective language). The correct answer is tennis, yet the model produces different outputs depending on the query language. This illustrates what we refer to as Language-Specific Knowledge.
  • Figure 2: Overview of LSKExtractor. Our method consists of two main steps. In Step 1, we embed training queries into a shared semantic space and cluster them based on topical similarity. For each cluster, we determine the expert language---i.e., the language that yields the most accurate or contextually appropriate reasoning---by comparing model responses across languages. In Step 2, during test-time inference, we embed the test query into the same space, identify its nearest cluster, and select the corresponding expert language (e.g., Spanish) to guide the model toward producing a more informed and culturally grounded response.
  • Figure 3: The main results of measuring LSK -- we show the performance of our various baselines and LSK across the three datasets. This setting is with reasoning, as opposed to Figure \ref{['fig: main_diff']} in Appendix \ref{['ap: impact_of_reasoning']}.
  • Figure 4: Understanding the impact of the clustering on LSKExtractor with 12, 49, and 96 clusters using the kmeans++ algorithm, and the HDBSCAN method (labeled as "DYN").
  • Figure 5: Distribution of languages selected across clusters (12-means clustering), across datasets.
  • ...and 13 more figures