Improving Rare Word Translation With Dictionaries and Attention Masking

Kenneth J. Sible; David Chiang

Improving Rare Word Translation With Dictionaries and Attention Masking

Kenneth J. Sible, David Chiang

TL;DR

This paper proposes appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions, and finds that including definitions for rare words improves performance by up to 1.6 MacroF1.

Abstract

In machine translation, rare words continue to be a problem for the dominant encoder-decoder architecture, especially in low-resource and out-of-domain translation settings. Human translators solve this problem with monolingual or bilingual dictionaries. In this paper, we propose appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions. We find that including definitions for rare words improves performance by up to 1.0 BLEU and 1.6 MacroF1.

Improving Rare Word Translation With Dictionaries and Attention Masking

TL;DR

Abstract

Paper Structure (24 sections, 2 equations, 3 figures, 3 tables)

This paper contains 24 sections, 2 equations, 3 figures, 3 tables.

Introduction
Related Work
Dictionaries as translators
Dictionaries as text
Methodology
Headword Selection
Definition Retrieval
Attention Masking
Experiments
Translation Model
Training/Evaluation Data
Lemmatizer and Dictionary
Experimental Setup
Hyperparameter Search
Results
...and 9 more sections

Figures (3)

Figure 1: We append definitions of kühn 'bold, brave' and Held 'hero' to a sentence, and use an attention mask (with learnable strength) to inform the model which definitions correspond to which words. In each picture, the query vectors are above, with one query vector shaded yellow, and the key/value vectors are below, shaded to indicate the strength of the attention mask (black = not masked, white = masked).
Figure 2: Our system uses two attention masks with learnable strengths. Rows are queries; columns are keys/values. Black = not masked; white = masked. Mask $\mathbf{M}^1$ allows each source word to attend to its definitions (if any). Mask $\mathbf{M}^2$ allows each definition word to attend to the word it defines.
Figure 3: The attention scores of the Masking model for the German sentence: "Mit seiner Tarnkappe schlich sich Siegfried aus der Burg" with the definition string "invisibility cloak crept slunk tiptoed Sigurd." The attention scores are summed for all encoder layers and attention heads. We observe both attention masks being utilized by the model.

Improving Rare Word Translation With Dictionaries and Attention Masking

TL;DR

Abstract

Improving Rare Word Translation With Dictionaries and Attention Masking

Authors

TL;DR

Abstract

Table of Contents

Figures (3)