Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents

Manoj Balaji Jagadeeshan; Prince Raj; Pawan Goyal

Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents

Manoj Balaji Jagadeeshan, Prince Raj, Pawan Goyal

TL;DR

Anveshana introduces a new benchmark dataset for cross-lingual information retrieval from English queries to Sanskrit documents, focusing on Srimadbhagavatam and comprising 3,400 query-document pairs across 334 documents. The study evaluates three CLIR paradigms—Query Translation (QT), Document Translation (DT), and Direct Retrieve (DR)—using a mix of translation-based and embedding-based models, with DT often delivering the strongest results. Publicly available on HuggingFace, the dataset enables robust benchmarking and fosters progress in Sanskrit CLIR, including zero-shot evaluations and advanced retrieval augmentations like REPLUG. The work highlights the value of translation-aware retrieval for ancient texts while pointing to future improvements in data scale, knowledge augmentation, and constrained translation strategies to further enhance accessibility and comprehension of Sanskrit scriptures.

Abstract

The study presents a comprehensive benchmark for retrieving Sanskrit documents using English queries, focusing on the chapters of the Srimadbhagavatam. It employs a tripartite approach: Direct Retrieval (DR), Translation-based Retrieval (DT), and Query Translation (QT), utilizing shared embedding spaces and advanced translation methods to enhance retrieval systems in a RAG framework. The study fine-tunes state-of-the-art models for Sanskrit's linguistic nuances, evaluating models such as BM25, REPLUG, mDPR, ColBERT, Contriever, and GPT-2. It adapts summarization techniques for Sanskrit documents to improve QA processing. Evaluation shows DT methods outperform DR and QT in handling the cross-lingual challenges of ancient texts, improving accessibility and understanding. A dataset of 3,400 English-Sanskrit query-document pairs underpins the study, aiming to preserve Sanskrit scriptures and share their philosophical importance widely. Our dataset is publicly available at https://huggingface.co/datasets/manojbalaji1/anveshana

Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents

TL;DR

Abstract

Anveshana: A New Benchmark Dataset for Cross-Lingual Information Retrieval On English Queries and Sanskrit Documents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)