Table of Contents
Fetching ...

Leveraging Domain Knowledge at Inference Time for LLM Translation: Retrieval versus Generation

Bryan Li, Jiaming Luo, Eleftheria Briakou, Colin Cherry

TL;DR

This work systematically compares inference-time domain adaptation strategies for MT using LLMs, contrasting retrieval- and generation-based knowledge and demonstrations versus terminology across law, medical, and religious texts. The study finds that demonstrations are consistently more effective than terminology, and retrieval-based approaches outperform generation-based ones; notably, generating domain-specific demonstrations can substantially boost weaker models, helping them approach larger models' zero-shot performance. Analyses reveal that much of the benefit from retrieved demonstrations derives from style alignment with the corpus rather than explicit domain-terminology translations. The findings suggest practical, low-cost avenues for domain adaptation in MT with LLMs and highlight the potential for cross-LLM knowledge transfer to distill strengths from larger models into smaller ones, while also pointing to limitations of the evaluated dataset and resource assumptions for broader generalization.

Abstract

While large language models (LLMs) have been increasingly adopted for machine translation (MT), their performance for specialist domains such as medicine and law remains an open challenge. Prior work has shown that LLMs can be domain-adapted at test-time by retrieving targeted few-shot demonstrations or terminologies for inclusion in the prompt. Meanwhile, for general-purpose LLM MT, recent studies have found some success in generating similarly useful domain knowledge from an LLM itself, prior to translation. Our work studies domain-adapted MT with LLMs through a careful prompting setup, finding that demonstrations consistently outperform terminology, and retrieval consistently outperforms generation. We find that generating demonstrations with weaker models can close the gap with larger model's zero-shot performance. Given the effectiveness of demonstrations, we perform detailed analyses to understand their value. We find that domain-specificity is particularly important, and that the popular multi-domain benchmark is testing adaptation to a particular writing style more so than to a specific domain.

Leveraging Domain Knowledge at Inference Time for LLM Translation: Retrieval versus Generation

TL;DR

This work systematically compares inference-time domain adaptation strategies for MT using LLMs, contrasting retrieval- and generation-based knowledge and demonstrations versus terminology across law, medical, and religious texts. The study finds that demonstrations are consistently more effective than terminology, and retrieval-based approaches outperform generation-based ones; notably, generating domain-specific demonstrations can substantially boost weaker models, helping them approach larger models' zero-shot performance. Analyses reveal that much of the benefit from retrieved demonstrations derives from style alignment with the corpus rather than explicit domain-terminology translations. The findings suggest practical, low-cost avenues for domain adaptation in MT with LLMs and highlight the potential for cross-LLM knowledge transfer to distill strengths from larger models into smaller ones, while also pointing to limitations of the evaluated dataset and resource assumptions for broader generalization.

Abstract

While large language models (LLMs) have been increasingly adopted for machine translation (MT), their performance for specialist domains such as medicine and law remains an open challenge. Prior work has shown that LLMs can be domain-adapted at test-time by retrieving targeted few-shot demonstrations or terminologies for inclusion in the prompt. Meanwhile, for general-purpose LLM MT, recent studies have found some success in generating similarly useful domain knowledge from an LLM itself, prior to translation. Our work studies domain-adapted MT with LLMs through a careful prompting setup, finding that demonstrations consistently outperform terminology, and retrieval consistently outperforms generation. We find that generating demonstrations with weaker models can close the gap with larger model's zero-shot performance. Given the effectiveness of demonstrations, we perform detailed analyses to understand their value. We find that domain-specificity is particularly important, and that the popular multi-domain benchmark is testing adaptation to a particular writing style more so than to a specific domain.

Paper Structure

This paper contains 35 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Illustration of the main MT settings, for an example source text in German. The two knowledge strategies are demonstrations vs. terminology; the two sources are retrieval vs. generation. This gives 4 settings for comparison. Within a strategy, we use the same prompts, varying only the provided information.
  • Figure 2: Illustration of our process to decompose the contributions of retrieved demonstrations into style and terminology. We first extract the source-target term pairs using a simple function, and aggregate them into a local terminology. Then, the remaining tokens are the style templates, with the terms masked. Note that in the actual data, we use <MASK> instead of [].
  • Figure 3: Results for zero-shot, external retrieval, terms from demonstrations, and style from demonstrations.
  • Figure 4: Prompt for zero-shot MT.
  • Figure 5: Prompt for MT with demonstrations (also known as few-shot MT in prior work). This prompt is used for both demonstration retrieval and demonstration generation.
  • ...and 8 more figures