Table of Contents
Fetching ...

Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

Alsu Sagirova, Mikhail Burtsev

TL;DR

The paper addresses the lack of explicit memory in standard Transformer architectures by introducing symbolic working memory in the decoder to store contextual knowledge as interpretable memory tokens. The proposed method interleaves memory tokens with target predictions, using a fixed memory size of $M=10$ and a final layer enlarged to $target\_vocabulary\_size+2$ to distinguish memory from target tokens, with memory reading via conventional attention. Across four Russian-to-English MT datasets of varying complexity, the approach improves translation quality and enables analysis of memory content, showing that memory stores keywords and content words, and that memory diversity correlates with task difficulty while shrinking with fine-tuning. Overall, the work demonstrates a neuro-symbolic, interpretable memory mechanism that enhances translation performance and provides insight into model decision-making, potentially aiding debugging and domain adaptation efforts.

Abstract

Even though Transformers are extensively used for Natural Language Processing tasks, especially for machine translation, they lack an explicit memory to store key concepts of processed texts. This paper explores the properties of the content of symbolic working memory added to the Transformer model decoder. Such working memory enhances the quality of model predictions in machine translation task and works as a neural-symbolic representation of information that is important for the model to make correct translations. The study of memory content revealed that translated text keywords are stored in the working memory, pointing to the relevance of memory content to the processed text. Also, the diversity of tokens and parts of speech stored in memory correlates with the complexity of the corpora for machine translation task.

Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

TL;DR

The paper addresses the lack of explicit memory in standard Transformer architectures by introducing symbolic working memory in the decoder to store contextual knowledge as interpretable memory tokens. The proposed method interleaves memory tokens with target predictions, using a fixed memory size of and a final layer enlarged to to distinguish memory from target tokens, with memory reading via conventional attention. Across four Russian-to-English MT datasets of varying complexity, the approach improves translation quality and enables analysis of memory content, showing that memory stores keywords and content words, and that memory diversity correlates with task difficulty while shrinking with fine-tuning. Overall, the work demonstrates a neuro-symbolic, interpretable memory mechanism that enhances translation performance and provides insight into model decision-making, potentially aiding debugging and domain adaptation efforts.

Abstract

Even though Transformers are extensively used for Natural Language Processing tasks, especially for machine translation, they lack an explicit memory to store key concepts of processed texts. This paper explores the properties of the content of symbolic working memory added to the Transformer model decoder. Such working memory enhances the quality of model predictions in machine translation task and works as a neural-symbolic representation of information that is important for the model to make correct translations. The study of memory content revealed that translated text keywords are stored in the working memory, pointing to the relevance of memory content to the processed text. Also, the diversity of tokens and parts of speech stored in memory correlates with the complexity of the corpora for machine translation task.
Paper Structure (6 sections, 6 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 6 sections, 6 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Transformer with the working memory-augmented decoder. The decoder inputs are the tokens generated to the moment $y_1,\dots,y_{t-1}$ and the corresponding memory flags $m_1,\dots,m_{t-1}.$ The memory flag is a binary value: $m_i=1$ means that $y_i$ is the target prediction token, and $m_j=0$ means $y_j$ is the working memory token. The final layer of the model has an expanded output size $= target\_vocabulary\_size+2$. The loss function takes into account the difference between the target sequence predictions and the real targets rather than the memory tokens.
  • Figure 2: Distributions of the number of unique tokens stored in working memory for the TED, WSC, IT documents, and Open Subtitles datasets. The histograms' legend before and after fine-tuning is presented in figure (b). Before fine-tuning, WSC, Open Subtitles, and IT documents memory diversity was larger than TED predictions' memory diversity. After fine-tuning, all datasets' working memory was mostly filled with a single repeating token. So, while processing the unseen data, the model exhibits higher variability of the working memory content. More complex datasets demonstrate higher memory diversity. After fine-tuning, the model was aligned with the data, and working memory had more repetitive tokens.
  • Figure 3: Average working memory diversity measured after each epoch of fine-tuning. The dashed lines are linear least squares fits. The plot confirms that during fine-tuning, the working memory content becomes more uniform. The minimal number of unique memory tokens is larger for more complex texts (IT docs and WSC) than for simpler texts (Open Subtitles and TED).
  • Figure 4: Probabilities (with confidence intervals) to find one or more (a) keywords extracted from the model predictions and (b) content words in working memory for all datasets. The keywords probability difference is significant for the IT documents-TED pair before and after fine-tuning $(p<0.01)$ and for IT documents-Open Subtitles pair after fine-tuning $(p<0.0001).$ Content words probabilities differ significantly for all complex-simple dataset pairs before and after fine-tuning $(p<0.01).$
  • Figure 5: Dependence of the average number of unique tokens in memory from the model predicted sequence length (with the dashed line showing a linear least squares fit). The average memory diversity does not significantly depend on the model prediction length both before and after fine-tuning.
  • ...and 1 more figures