Table of Contents
Fetching ...

Targum -- A Multilingual New Testament Translation Corpus

Maciej Rapacz, Aleksander Smywiński-Pohl

TL;DR

This work introduces targum, a multilingual corpus of 657 New Testament translations with 352 unique versions, spanning five European languages and built from 12 online libraries and one preexisting corpus. It manually annotates translations with canonical identifiers and edition years ($canonical\_id$ and $canonical\_revision\_year$) to support flexible, multilevel analyses of translation history. By prioritizing depth over breadth, targum enables micro-level analyses of translation families and macro-level cross-tradition comparisons, while delivering a complete archive plus rich metadata as a research platform. The paper analyzes scale, diachronic distribution, verse-length patterns, and translation-similarity clustering to demonstrate the resource's analytical potential and establish a benchmark for quantitative study in translation history.

Abstract

Many European languages possess rich biblical translation histories, yet existing corpora - in prioritizing linguistic breadth - often fail to capture this depth. To address this gap, we introduce a multilingual corpus of 657 New Testament translations, of which 352 are unique, with unprecedented depth in five languages: English (208 unique versions from 396 total), French (41 from 78), Italian (18 from 33), Polish (30 from 48), and Spanish (55 from 102). Aggregated from 12 online biblical libraries and one preexisting corpus, each translation is manually annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision. This canonicalization empowers researchers to define "uniqueness" for their own needs: they can perform micro-level analyses on translation families, such as the KJV lineage, or conduct macro-level studies by deduplicating closely related texts. By providing the first resource designed for such flexible, multilevel analysis, our corpus establishes a new benchmark for the quantitative study of translation history.

Targum -- A Multilingual New Testament Translation Corpus

TL;DR

This work introduces targum, a multilingual corpus of 657 New Testament translations with 352 unique versions, spanning five European languages and built from 12 online libraries and one preexisting corpus. It manually annotates translations with canonical identifiers and edition years ( and ) to support flexible, multilevel analyses of translation history. By prioritizing depth over breadth, targum enables micro-level analyses of translation families and macro-level cross-tradition comparisons, while delivering a complete archive plus rich metadata as a research platform. The paper analyzes scale, diachronic distribution, verse-length patterns, and translation-similarity clustering to demonstrate the resource's analytical potential and establish a benchmark for quantitative study in translation history.

Abstract

Many European languages possess rich biblical translation histories, yet existing corpora - in prioritizing linguistic breadth - often fail to capture this depth. To address this gap, we introduce a multilingual corpus of 657 New Testament translations, of which 352 are unique, with unprecedented depth in five languages: English (208 unique versions from 396 total), French (41 from 78), Italian (18 from 33), Polish (30 from 48), and Spanish (55 from 102). Aggregated from 12 online biblical libraries and one preexisting corpus, each translation is manually annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision. This canonicalization empowers researchers to define "uniqueness" for their own needs: they can perform micro-level analyses on translation families, such as the KJV lineage, or conduct macro-level studies by deduplicating closely related texts. By providing the first resource designed for such flexible, multilevel analysis, our corpus establishes a new benchmark for the quantitative study of translation history.
Paper Structure (22 sections, 3 figures, 2 tables)

This paper contains 22 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Diachronic distribution of unique translations for each core language. We mark a single dot per translation, regardless of the number of the exact copies we have obtained.
  • Figure 2: Distribution of verse lengths (in characters) across all unique translations for each chapter of the New Testament. The thick blue line represents the interquartile range (IQR), and the dark blue dot marks the median length.
  • Figure 3: Pairwise similarity (cosine for semantic, Levenshtein for lexical) matrices between translations for each language, ordered by their year of publication.