Table of Contents
Fetching ...

Improving Rare Word Translation With Dictionaries and Attention Masking

Kenneth J. Sible, David Chiang

TL;DR

This paper proposes appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions, and finds that including definitions for rare words improves performance by up to 1.6 MacroF1.

Abstract

In machine translation, rare words continue to be a problem for the dominant encoder-decoder architecture, especially in low-resource and out-of-domain translation settings. Human translators solve this problem with monolingual or bilingual dictionaries. In this paper, we propose appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions. We find that including definitions for rare words improves performance by up to 1.0 BLEU and 1.6 MacroF1.

Improving Rare Word Translation With Dictionaries and Attention Masking

TL;DR

This paper proposes appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions, and finds that including definitions for rare words improves performance by up to 1.6 MacroF1.

Abstract

In machine translation, rare words continue to be a problem for the dominant encoder-decoder architecture, especially in low-resource and out-of-domain translation settings. Human translators solve this problem with monolingual or bilingual dictionaries. In this paper, we propose appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions. We find that including definitions for rare words improves performance by up to 1.0 BLEU and 1.6 MacroF1.
Paper Structure (24 sections, 2 equations, 3 figures, 3 tables)

This paper contains 24 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: We append definitions of kühn 'bold, brave' and Held 'hero' to a sentence, and use an attention mask (with learnable strength) to inform the model which definitions correspond to which words. In each picture, the query vectors are above, with one query vector shaded yellow, and the key/value vectors are below, shaded to indicate the strength of the attention mask (black = not masked, white = masked).
  • Figure 2: Our system uses two attention masks with learnable strengths. Rows are queries; columns are keys/values. Black = not masked; white = masked. Mask $\mathbf{M}^1$ allows each source word to attend to its definitions (if any). Mask $\mathbf{M}^2$ allows each definition word to attend to the word it defines.
  • Figure 3: The attention scores of the Masking model for the German sentence: "Mit seiner Tarnkappe schlich sich Siegfried aus der Burg" with the definition string "invisibility cloak crept slunk tiptoed Sigurd." The attention scores are summed for all encoder layers and attention heads. We observe both attention masks being utilized by the model.