Chandomitra: Towards Generating Structured Sanskrit Poetry from Natural Language Inputs
Manoj Balaji Jagadeeshan, Samarth Bhatia, Pretam Ray, Harshul Raj Surana, Akhil Rajeev P, Priya Mishra, Annarao Kulkarni, Ganesh Ramakrishnan, Prathosh AP, Pawan Goyal
TL;DR
Chandomitra tackles English-to-Sanskrit poetry generation in the rigid Anuṣṭubh meter by introducing a dedicated dataset and two complementary methods: constrained decoding for strict metrical control and instruction fine-tuning to imbue models with metrical and stylistic constraints. The constrained decoding approach with NLLB-dist-1.3B achieves near-perfect syntactic compliance at $99.86 ext{%}$ full Anuṣṭubh, while instruction-fine-tuned models offer stronger semantic coherence and poetic quality, exemplified by Phi-4-14B and Mistral-Nemo-2407-12B. Human evaluation reveals a trade-off between metrical rigidity and fluency, with instruction-tuned models achieving superior poetic and semantic performance. Generalization experiments demonstrate robust metrical generalization across additional Sanskrit meters and even other languages, indicating potential for broader applicability in multilingual metrical poetry generation.
Abstract
Text Generation has achieved remarkable performance using large language models. It has also been recently well-studied that these large language models are capable of creative generation tasks but prominently for high-resource languages. This prompts a fundamental question: Is there a way to utilize these (large) language models for structured poetry generation in a low-resource language, such as Sanskrit? We present Chandomitra, an English input to structured Sanskrit Poetry translation dataset, specifically adhering to the Anushtubh meter. We benchmark various open and closed models, and scrutinize specialized techniques such as constrained decoding and instruction fine-tuning, for the proposed task. Our constrained decoding methodology achieves 99.86% syntactic accuracy in generating metrically valid Sanskrit poetry, outperforming GPT-4o (1-shot: 31.24%). Our best-performing instruction-tuned model, on the other hand, performs better in semantic coherence with the English input, at the expense of slightly lower syntactic accuracy. Human evaluation further reveals that instruction fine-tuned model is better able to capture the poetic aspects. Data and Code are available.
