Table of Contents
Fetching ...

Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari

Harshal Kausadikar, Tanvi Kale, Onkar Susladkar, Sparsh Mittal

TL;DR

Historic Scripts to Modern Vision tackles the transliteration of Modi-script Marathi into Devanagari. It provides MoDeTrans, a 2,043-image real-world Modi-script dataset, plus SynthMoDe, and introduces MoScNet, a knowledge-distillation-based Vision-Language Model with LoRA-tuned teachers and a lightweight decoder-student. The architecture fuses image and language features through a fusion token $Z_C = [ Z_I ⊙ Z_L ]$ and optimizes with $L = L_{CE}^S + L_{L2} + L_{D_{KL}}$, achieving state-of-the-art BLEU on transliteration and strong OCR performance on ICDAR benchmarks. Real-world era coverage and meticulous preprocessing support robust digitization of fragile historical records, enabling scalable transliteration research in low-resource settings.

Abstract

In medieval India, the Marathi language was written using the Modi script. The texts written in Modi script include extensive knowledge about medieval sciences, medicines, land records and authentic evidence about Indian history. Around 40 million documents are in poor condition and have not yet been transliterated. Furthermore, only a few experts in this domain can transliterate this script into English or Devanagari. Most of the past research predominantly focuses on individual character recognition. A system that can transliterate Modi script documents to Devanagari script is needed. We propose the MoDeTrans dataset, comprising 2,043 images of Modi script documents accompanied by their corresponding textual transliterations in Devanagari. We further introduce MoScNet (\textbf{Mo}di \textbf{Sc}ript \textbf{Net}work), a novel Vision-Language Model (VLM) framework for transliterating Modi script images into Devanagari text. MoScNet leverages Knowledge Distillation, where a student model learns from a teacher model to enhance transliteration performance. The final student model of MoScNet has better performance than the teacher model while having 163$\times$ lower parameters. Our work is the first to perform direct transliteration from the handwritten Modi script to the Devanagari script. MoScNet also shows competitive results on the optical character recognition (OCR) task.

Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari

TL;DR

Historic Scripts to Modern Vision tackles the transliteration of Modi-script Marathi into Devanagari. It provides MoDeTrans, a 2,043-image real-world Modi-script dataset, plus SynthMoDe, and introduces MoScNet, a knowledge-distillation-based Vision-Language Model with LoRA-tuned teachers and a lightweight decoder-student. The architecture fuses image and language features through a fusion token and optimizes with , achieving state-of-the-art BLEU on transliteration and strong OCR performance on ICDAR benchmarks. Real-world era coverage and meticulous preprocessing support robust digitization of fragile historical records, enabling scalable transliteration research in low-resource settings.

Abstract

In medieval India, the Marathi language was written using the Modi script. The texts written in Modi script include extensive knowledge about medieval sciences, medicines, land records and authentic evidence about Indian history. Around 40 million documents are in poor condition and have not yet been transliterated. Furthermore, only a few experts in this domain can transliterate this script into English or Devanagari. Most of the past research predominantly focuses on individual character recognition. A system that can transliterate Modi script documents to Devanagari script is needed. We propose the MoDeTrans dataset, comprising 2,043 images of Modi script documents accompanied by their corresponding textual transliterations in Devanagari. We further introduce MoScNet (\textbf{Mo}di \textbf{Sc}ript \textbf{Net}work), a novel Vision-Language Model (VLM) framework for transliterating Modi script images into Devanagari text. MoScNet leverages Knowledge Distillation, where a student model learns from a teacher model to enhance transliteration performance. The final student model of MoScNet has better performance than the teacher model while having 163 lower parameters. Our work is the first to perform direct transliteration from the handwritten Modi script to the Devanagari script. MoScNet also shows competitive results on the optical character recognition (OCR) task.

Paper Structure

This paper contains 20 sections, 11 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Modi script characters - 12 vowels, 10 numerals, and 36 consonants.
  • Figure 2: Era-wise distribution of images in the MoDeTrans dataset
  • Figure 3: Modi script images from different eras with Devanagari transliteration from MoDeTrans Dataset
  • Figure 4: Steps involved in the creation of the MoDeTrans dataset
  • Figure 5: MoScNet architecture diagram showing teacher-student knowledge distillation Model
  • ...and 4 more figures