Table of Contents
Fetching ...

OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature

Alisha Srivastava, Emir Korukluoglu, Minh Nhat Le, Duyen Tran, Chau Minh Pham, Marzena Karpinska, Mohit Iyyer

TL;DR

This study probes multilingual and cross-lingual memorization in large language models with OWL, a dataset of 31,540 aligned literary passages across ten languages (including English originals, official translations, and six newly translated languages) plus audio. It introduces three probing tasks—direct probing, name cloze, and prefix probing—to quantify memorization and cross-lingual transfer, alongside perturbations, quantization, and audio ablation analyses. The results show that models can recall content across languages, with stronger performance on direct recall than on cloze-style prompts, and that cross-lingual transfer occurs even for unseen translations, though with lower accuracy. Character names emerge as a strong cue for recall, and perturbations or modality changes reduce but do not obliterate memorized knowledge; the authors release dataset and code to spur further research while acknowledging limitations around translation quality, training data opacity, and legal/ethical considerations.

Abstract

Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented in translation. To do so, we introduce OWL, a dataset of 31.5K aligned excerpts from 20 books in ten languages, including English originals, official translations (Vietnamese, Spanish, Turkish), and new translations in six low-resource languages (Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian). We evaluate memorization across model families and sizes through three tasks: (1) direct probing, which asks the model to identify a book's title and author; (2) name cloze, which requires predicting masked character names; and (3) prefix probing, which involves generating continuations. We find that LLMs consistently recall content across languages, even for texts without direct translation in pretraining data. GPT-4o, for example, identifies authors and titles 69% of the time and masked entities 6% of the time in newly translated excerpts. Perturbations (e.g., masking characters, shuffling words) modestly reduce direct probing accuracy (7% drop for shuffled official translations). Our results highlight the extent of cross-lingual memorization and provide insights on the differences between the models.

OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature

TL;DR

This study probes multilingual and cross-lingual memorization in large language models with OWL, a dataset of 31,540 aligned literary passages across ten languages (including English originals, official translations, and six newly translated languages) plus audio. It introduces three probing tasks—direct probing, name cloze, and prefix probing—to quantify memorization and cross-lingual transfer, alongside perturbations, quantization, and audio ablation analyses. The results show that models can recall content across languages, with stronger performance on direct recall than on cloze-style prompts, and that cross-lingual transfer occurs even for unseen translations, though with lower accuracy. Character names emerge as a strong cue for recall, and perturbations or modality changes reduce but do not obliterate memorized knowledge; the authors release dataset and code to spur further research while acknowledging limitations around translation quality, training data opacity, and legal/ethical considerations.

Abstract

Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented in translation. To do so, we introduce OWL, a dataset of 31.5K aligned excerpts from 20 books in ten languages, including English originals, official translations (Vietnamese, Spanish, Turkish), and new translations in six low-resource languages (Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian). We evaluate memorization across model families and sizes through three tasks: (1) direct probing, which asks the model to identify a book's title and author; (2) name cloze, which requires predicting masked character names; and (3) prefix probing, which involves generating continuations. We find that LLMs consistently recall content across languages, even for texts without direct translation in pretraining data. GPT-4o, for example, identifies authors and titles 69% of the time and masked entities 6% of the time in newly translated excerpts. Perturbations (e.g., masking characters, shuffling words) modestly reduce direct probing accuracy (7% drop for shuffled official translations). Our results highlight the extent of cross-lingual memorization and provide insights on the differences between the models.

Paper Structure

This paper contains 55 sections, 27 figures, 18 tables.

Figures (27)

  • Figure 1: Top:Owl collection pipeline: (1) Identify English novels with official Turkish, Spanish, and Vietnamese translations; (2) Tag passages with named characters; (3) Align translations to English originals using Par3 aligner thai2022exploringdocumentlevelliterarymachine; (4) Filter alignments based on length and BLEU scores, followed by manual verification; (5) Translate validated English passages into six new languages without official translations. Bottom: Probing tasks: (1) Direct Probing (DP) -- identify author/title from a passage; (2) Name Cloze (NC) -- predict masked names in passages; (3) Prefix Probing (PP) -- generate continuations from passage prefixes. Prompt texts omitted for clarity (see \ref{['fig:direct-probing']}, \ref{['fig:nct']}, \ref{['fig:pp']}). The figure shows outputs from GPT-4o. See \ref{['tab:data_overview']} for an overview of our experiments.
  • Figure 2: Overall performance: GPT-4o consistently outperforms other models in probing tasks, followed by LLaMA 405B. Direct Probing (DP; reported for passages with character names) and Prefix Probing (PP) use unmasked passages, while the Name Cloze Task (NCT) uses masked ones with named characters removed. PP performance is measured with ChrF++. PP performance on unseen languages is not reported as it is unclear what the gold continuation should be.
  • Figure 3: Audio vs. Text accuracy on English passages with a character name. GPT-4o-audio exhibits substantial performance across all tasks and modalities. The overlap line denotes the percentage of passages answered correctly in both modalities.
  • Figure 4: Direct probing: Average accuracy across models for shuffled versus standard text inputs. Accuracy decreases from standard to shuffled inputs across all perturbations and language settings, with non-trivial shuffled accuracy on English and official translations.
  • Figure 5: Name cloze: Unshuffled inputs outperform shuffled inputs across all language settings, with non-trivial accuracy on English and official translations.
  • ...and 22 more figures