Table of Contents
Fetching ...

LATA: A Tool for LLM-Assisted Translation Annotation

Baorong Huang, Ali Asiri

TL;DR

This work tackles the challenge of building high-quality translation parallel corpora for structurally divergent language pairs, such as Arabic–English, by moving beyond simple sentence alignment to multi-layered annotation. It presents LATA, a desktop tool that uses a template-based Prompt Manager and large language models to perform sentence segmentation and alignment under strict JSON output constraints within a human-in-the-loop workflow. The translation annotation pipeline comprises Document Metadata Collection, Paragraph Alignment Annotation, and LLM-Assisted Sentence Segmentation and Annotation, producing CES-compliant XML outputs and enabling custom translation technique annotations. The approach balances automation efficiency with linguistic precision for complex translation phenomena, and the authors provide a MIT-licensed implementation with planned extensions to word-level annotation, a bilingual knowledge graph, and multimodal anchoring.

Abstract

The construction of high-quality parallel corpora for translation research has increasingly evolved from simple sentence alignment to complex, multi-layered annotation tasks. This methodological shift presents significant challenges for structurally divergent language pairs, such as Arabic--English, where standard automated tools frequently fail to capture deep linguistic shifts or semantic nuances. This paper introduces a novel, LLM-assisted interactive tool designed to reduce the gap between scalable automation and the rigorous precision required for expert human judgment. Unlike traditional statistical aligners, our system employs a template-based Prompt Manager that leverages large language models (LLMs) for sentence segmentation and alignment under strict JSON output constraints. In this tool, automated preprocessing integrates into a human-in-the-loop workflow, allowing researchers to refine alignments and apply custom translation technique annotations through a stand-off architecture. By leveraging LLM-assisted processing, the tool balances annotation efficiency with the linguistic precision required to analyze complex translation phenomena in specialized domains.

LATA: A Tool for LLM-Assisted Translation Annotation

TL;DR

This work tackles the challenge of building high-quality translation parallel corpora for structurally divergent language pairs, such as Arabic–English, by moving beyond simple sentence alignment to multi-layered annotation. It presents LATA, a desktop tool that uses a template-based Prompt Manager and large language models to perform sentence segmentation and alignment under strict JSON output constraints within a human-in-the-loop workflow. The translation annotation pipeline comprises Document Metadata Collection, Paragraph Alignment Annotation, and LLM-Assisted Sentence Segmentation and Annotation, producing CES-compliant XML outputs and enabling custom translation technique annotations. The approach balances automation efficiency with linguistic precision for complex translation phenomena, and the authors provide a MIT-licensed implementation with planned extensions to word-level annotation, a bilingual knowledge graph, and multimodal anchoring.

Abstract

The construction of high-quality parallel corpora for translation research has increasingly evolved from simple sentence alignment to complex, multi-layered annotation tasks. This methodological shift presents significant challenges for structurally divergent language pairs, such as Arabic--English, where standard automated tools frequently fail to capture deep linguistic shifts or semantic nuances. This paper introduces a novel, LLM-assisted interactive tool designed to reduce the gap between scalable automation and the rigorous precision required for expert human judgment. Unlike traditional statistical aligners, our system employs a template-based Prompt Manager that leverages large language models (LLMs) for sentence segmentation and alignment under strict JSON output constraints. In this tool, automated preprocessing integrates into a human-in-the-loop workflow, allowing researchers to refine alignments and apply custom translation technique annotations through a stand-off architecture. By leveraging LLM-assisted processing, the tool balances annotation efficiency with the linguistic precision required to analyze complex translation phenomena in specialized domains.
Paper Structure (23 sections, 1 equation, 9 figures)

This paper contains 23 sections, 1 equation, 9 figures.

Figures (9)

  • Figure 1: Hierarchical Alignment Pipeline. The process begins with metadata collection from source and target documents, followed by paragraph-level alignment. An LLM layer provides automated sentence segmentation and alignment to support the transition to granular sentence-level annotation. The final output is zipped three structured, CES-compliant XML files that contain alignment links and qualitative descriptions of translation techniques (see Appendix)
  • Figure 2: Template-based prompt configuration with dynamic placeholders
  • Figure 3: Translation annotation configuration with customized name, description, and examples
  • Figure 4: Dual-pane interface for capturing source and target document metadata
  • Figure 5: Manual alignment adjustment interface for post-editing parallel or sentence alignments
  • ...and 4 more figures