Table of Contents
Fetching ...

"Don't Teach Minerva": Guiding LLMs Through Complex Syntax for Faithful Latin Translation with RAG

Sergio Torres Aguilar

TL;DR

This work tackles Latin-English MT, a morphologically rich and low-resource setting, by proposing a reproducible two-stage pipeline that combines a domain-specialized NLLB-1.3B drafter with a zero-shot, retrieval-augmented LLM refiner. The method leverages $k=5$ retrieved exemplars from a 50M-token Latin corpus to guide refinement, and is evaluated against Rosenthal in-domain data and a challenging OOD Yves Chartres dataset, showing open-source RAG systems can reach GPT-5 parity without task-specific fine-tuning. Key contributions include the detailed training protocol with LoRA-based DAPT, an explicit inference algorithm, comprehensive qualitative analyses, and a public release of code, models, and the Chartres dataset. The approach offers a scalable, transparent path to high-fidelity scholarly translations, balancing semantic adequacy with structural fidelity and enabling controllable translation styles beyond closed-model constraints.

Abstract

Translating a morphology-rich, low-resource language like Latin poses significant challenges. This paper introduces a reproducible draft-based refinement pipeline that elevates open-source Large Language Models (LLMs) to a performance level statistically comparable to top-tier proprietary systems. Our method first uses a fine-tuned NLLB-1.3B model to generate a high-quality, structurally faithful draft. A zero-shot LLM (Llama-3.3 or Qwen3) then polishes this draft, a process that can be further enhanced by augmenting the context with retrieved out-context examples (RAG). We demonstrate the robustness of this approach on two distinct benchmarks: a standard in-domain test set (Rosenthal, 2023) and a new, challenging out-of-domain (OOD) set of 12th-century Latin letters (2025). Our central finding is that this open-source RAG system achieves performance statistically comparable to the GPT-5 baseline, without any task-specific LLM fine-tuning. We release the pipeline, the Chartres OOD set, and evaluation scripts and models to facilitate replicability and further research.

"Don't Teach Minerva": Guiding LLMs Through Complex Syntax for Faithful Latin Translation with RAG

TL;DR

This work tackles Latin-English MT, a morphologically rich and low-resource setting, by proposing a reproducible two-stage pipeline that combines a domain-specialized NLLB-1.3B drafter with a zero-shot, retrieval-augmented LLM refiner. The method leverages retrieved exemplars from a 50M-token Latin corpus to guide refinement, and is evaluated against Rosenthal in-domain data and a challenging OOD Yves Chartres dataset, showing open-source RAG systems can reach GPT-5 parity without task-specific fine-tuning. Key contributions include the detailed training protocol with LoRA-based DAPT, an explicit inference algorithm, comprehensive qualitative analyses, and a public release of code, models, and the Chartres dataset. The approach offers a scalable, transparent path to high-fidelity scholarly translations, balancing semantic adequacy with structural fidelity and enabling controllable translation styles beyond closed-model constraints.

Abstract

Translating a morphology-rich, low-resource language like Latin poses significant challenges. This paper introduces a reproducible draft-based refinement pipeline that elevates open-source Large Language Models (LLMs) to a performance level statistically comparable to top-tier proprietary systems. Our method first uses a fine-tuned NLLB-1.3B model to generate a high-quality, structurally faithful draft. A zero-shot LLM (Llama-3.3 or Qwen3) then polishes this draft, a process that can be further enhanced by augmenting the context with retrieved out-context examples (RAG). We demonstrate the robustness of this approach on two distinct benchmarks: a standard in-domain test set (Rosenthal, 2023) and a new, challenging out-of-domain (OOD) set of 12th-century Latin letters (2025). Our central finding is that this open-source RAG system achieves performance statistically comparable to the GPT-5 baseline, without any task-specific LLM fine-tuning. We release the pipeline, the Chartres OOD set, and evaluation scripts and models to facilitate replicability and further research.

Paper Structure

This paper contains 22 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Overview of our two-stage RAG pipeline. A specialized NLLB model first generates a high-quality draft. This draft is then refined by a zero-shot LLM, which is augmented with in-context examples retrieved from a non-parallel corpus via semantic search.