Table of Contents
Fetching ...

Toward domain-specific machine translation and quality estimation systems

Javad Pourmostafa Roshan Sharami

Abstract

Machine Translation (MT) and Quality Estimation (QE) perform well in general domains but degrade under domain mismatch. This dissertation studies how to adapt MT and QE systems to specialized domains through a set of data-focused contributions. Chapter 2 presents a similarity-based data selection method for MT. Small, targeted in-domain subsets outperform much larger generic datasets and reach strong translation quality at lower computational cost. Chapter 3 introduces a staged QE training pipeline that combines domain adaptation with lightweight data augmentation. The method improves performance across domains, languages, and resource settings, including zero-shot and cross-lingual cases. Chapter 4 studies the role of subword tokenization and vocabulary in fine-tuning. Aligned tokenization-vocabulary setups lead to stable training and better translation quality, while mismatched configurations reduce performance. Chapter 5 proposes a QE-guided in-context learning method for large language models. QE models select examples that improve translation quality without parameter updates and outperform standard retrieval methods. The approach also supports a reference-free setup, reducing reliance on a single reference set. These results show that domain adaptation depends on data selection, representation, and efficient adaptation strategies. The dissertation provides methods for building MT and QE systems that perform reliably in domain-specific settings.

Toward domain-specific machine translation and quality estimation systems

Abstract

Machine Translation (MT) and Quality Estimation (QE) perform well in general domains but degrade under domain mismatch. This dissertation studies how to adapt MT and QE systems to specialized domains through a set of data-focused contributions. Chapter 2 presents a similarity-based data selection method for MT. Small, targeted in-domain subsets outperform much larger generic datasets and reach strong translation quality at lower computational cost. Chapter 3 introduces a staged QE training pipeline that combines domain adaptation with lightweight data augmentation. The method improves performance across domains, languages, and resource settings, including zero-shot and cross-lingual cases. Chapter 4 studies the role of subword tokenization and vocabulary in fine-tuning. Aligned tokenization-vocabulary setups lead to stable training and better translation quality, while mismatched configurations reduce performance. Chapter 5 proposes a QE-guided in-context learning method for large language models. QE models select examples that improve translation quality without parameter updates and outperform standard retrieval methods. The approach also supports a reference-free setup, reducing reliance on a single reference set. These results show that domain adaptation depends on data selection, representation, and efficient adaptation strategies. The dissertation provides methods for building MT and QE systems that perform reliably in domain-specific settings.

Paper Structure

This paper contains 154 sections, 3 equations, 12 figures, 29 tables, 1 algorithm.

Figures (12)

  • Figure 1: An overview of the proposed methodology. The notation $1 .. n$ in the figure indicates that the algorithm can select between $1$ and $n$ sentences, where $n$ is an arbitrary number. For example, if $n = 5$, the algorithm selects up to $5$ parallel instances from the out-of-domain corpus for each in-domain sentence.
  • Figure 1: Pseudocode outlining the proposed Search Algorithm. Each phase of the methodology is annotated alongside the relevant code. Function arguments are omitted for simplicity. The first element of the returning list (temp) includes the selected prompt, its associated score, and the translated text.
  • Figure 2: An iteration of selecting in-domain data
  • Figure 3: The difference of selected sub-corpora to in-domain test sets
  • Figure 4: The effect of different batch sizes on training the proposed models. Left and right Figures show validation accuracy and validation perplexity per step, respectively.
  • ...and 7 more figures