Pralekha: Cross-Lingual Document Alignment for Indic Languages
Sanjay Suryanarayanan, Haiyue Song, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Raj Dabre
TL;DR
This work tackles the challenge of mining parallel document pairs for Indic-language document-level MT, where existing CLDA methods struggle with limited context windows and reliance on metadata. It introduces Pralekha, a large-scale, human-verified benchmark spanning 12 languages (11 Indic + English) and two domains, to enable robust evaluation of CLDA methods. The core contribution is the Document Alignment Coefficient (DAC), a fine-grained, chunk-level alignment metric that pairs smaller text units and computes similarity as the ratio of aligned chunks to the average chunk count, DAC = \frac{2 \times N_{aligned}}{N_{src} + N_{tgt}}, enabling faster and more precise parallel-document mining than pooling-based approaches. Intrinsic experiments show DAC achieves higher precision and 2–3× speedups over sentence-based or pooling baselines, while extrinsic MT evaluations demonstrate that MT models trained on DAC-aligned data yield superior translation quality. By releasing Pralekha and the evaluation framework, the authors provide a practical foundation for scalable CLDA research and improved document-level MT for Indic languages, with DAC offering a robust balance between accuracy and efficiency.
Abstract
Mining parallel document pairs for document-level machine translation (MT) remains challenging due to the limitations of existing Cross-Lingual Document Alignment (CLDA) techniques. Existing methods often rely on metadata such as URLs, which are scarce, or on pooled document representations that fail to capture fine-grained alignment cues. Moreover, the limited context window of sentence embedding models hinders their ability to represent document-level context, while sentence-based alignment introduces a combinatorially large search space, leading to high computational cost. To address these challenges for Indic languages, we introduce Pralekha, a benchmark containing over 3 million aligned document pairs across 11 Indic languages and English, which includes 1.5 million English-Indic pairs. Furthermore, we propose Document Alignment Coefficient (DAC), a novel metric for fine-grained document alignment. Unlike pooling-based methods, DAC aligns documents by matching smaller chunks and computes similarity as the ratio of aligned chunks to the average number of chunks in a pair. Intrinsic evaluation shows that our chunk-based method is 2-3x faster while maintaining competitive performance, and that DAC achieves substantial gains over pooling-based baselines. Extrinsic evaluation further demonstrates that document-level MT models trained on DAC-aligned pairs consistently outperform those using baseline alignment methods. These results highlight DAC's effectiveness for parallel document mining. The dataset and evaluation framework are publicly available to support further research.
