Table of Contents
Fetching ...

LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs

Pei-Fu Guo, Yun-Da Tsai, Chun-Chia Hsu, Kai-Xin Chen, Ya-An Tsai, Kai-Wei Chang, Nanyun Peng, Mi-Yen Yeh, Shou-De Lin

TL;DR

LiveCLKTBench tackles the problem of reliably evaluating cross-lingual knowledge transfer in multilingual LLMs by constructing a leakage-free, real-world-grounded benchmark that is automatically refreshed. The method injects new knowledge via post-training on time-sensitive source-language documents and tests transfer in multiple languages using document-grounded QA derived from actual events. Key contributions include a four-stage generation pipeline (knowledge collection, QA generation, quality verification, translation), contamination control via temporal filters and model checks, and two metrics (Overall and Transfer) to separate in-language learning from cross-lingual transfer. Empirical results show that transfer is strongly influenced by linguistic distance and directionality, with improvements from larger models exhibiting diminishing returns and domain-dependent effects, underscoring the need for targeted multilingual strategies. Overall, the work provides a practical, scalable benchmark that can guide the development of more robust cross-lingual knowledge transfer in future multilingual LLMs.

Abstract

Evaluating cross-lingual knowledge transfer in large language models is challenging, as correct answers in a target language may arise either from genuine transfer or from prior exposure during pre-training. We present LiveCLKTBench, an automated generation pipeline specifically designed to isolate and measure cross-lingual knowledge transfer. Our pipeline identifies self-contained, time-sensitive knowledge entities from real-world domains, filters them based on temporal occurrence, and verifies them against the model's knowledge. The documents of these valid entities are then used to generate factual questions, which are translated into multiple languages to evaluate transferability across linguistic boundaries. Using LiveCLKTBench, we evaluate several LLMs across five languages and observe that cross-lingual transfer is strongly influenced by linguistic distance and often asymmetric across language directions. While larger models improve transfer, the gains diminish with scale and vary across domains. These findings provide new insights into multilingual transfer and demonstrate the value of LiveCLKTBench as a reliable benchmark for future research.

LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs

TL;DR

LiveCLKTBench tackles the problem of reliably evaluating cross-lingual knowledge transfer in multilingual LLMs by constructing a leakage-free, real-world-grounded benchmark that is automatically refreshed. The method injects new knowledge via post-training on time-sensitive source-language documents and tests transfer in multiple languages using document-grounded QA derived from actual events. Key contributions include a four-stage generation pipeline (knowledge collection, QA generation, quality verification, translation), contamination control via temporal filters and model checks, and two metrics (Overall and Transfer) to separate in-language learning from cross-lingual transfer. Empirical results show that transfer is strongly influenced by linguistic distance and directionality, with improvements from larger models exhibiting diminishing returns and domain-dependent effects, underscoring the need for targeted multilingual strategies. Overall, the work provides a practical, scalable benchmark that can guide the development of more robust cross-lingual knowledge transfer in future multilingual LLMs.

Abstract

Evaluating cross-lingual knowledge transfer in large language models is challenging, as correct answers in a target language may arise either from genuine transfer or from prior exposure during pre-training. We present LiveCLKTBench, an automated generation pipeline specifically designed to isolate and measure cross-lingual knowledge transfer. Our pipeline identifies self-contained, time-sensitive knowledge entities from real-world domains, filters them based on temporal occurrence, and verifies them against the model's knowledge. The documents of these valid entities are then used to generate factual questions, which are translated into multiple languages to evaluate transferability across linguistic boundaries. Using LiveCLKTBench, we evaluate several LLMs across five languages and observe that cross-lingual transfer is strongly influenced by linguistic distance and often asymmetric across language directions. While larger models improve transfer, the gains diminish with scale and vary across domains. These findings provide new insights into multilingual transfer and demonstrate the value of LiveCLKTBench as a reliable benchmark for future research.

Paper Structure

This paper contains 31 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Leakage Prevention in LiveCLKTBench. The pipeline prevents data leakage by selecting valid knowledge entities that contain facts unknown to pretrained models. Specifically, it identifies independent, time-sensitive real-world entities, filters them by temporal occurrence, and cross-checks them against model outputs to eliminate any entities already known to pretrained models.
  • Figure 2: LiveCLKTBench Pipeline. The generation process consists of four stages: (1) collecting independent, time-sensitive knowledge entities; (2) generating document-grounded question–answer pairs; (3) verifying data quality using a verifier LLM; and (4) translating verified questions into multiple languages for evaluation.
  • Figure 3: Language-level Transferability. Heatmaps show Transfer Scores for each $(L_{\text{train}}, L_{\text{test}})$ pair across models, sorted by average Overall score. Darker colors indicate stronger transferability.
  • Figure 4: Effect of Model Size. Overall and Transfer Scores across model families of different parameter size, shown separately by domain.