Table of Contents
Fetching ...

Memorizing Documents with Guidance in Large Language Models

Bumjin Park, Jaesik Choi

TL;DR

This work proposes document-wise memory architecture to track document memories in training, which provides different memory entries for documents and high recall of document-related content in generation with trained document-wise memories.

Abstract

Training data plays a pivotal role in AI models. Large language models (LLMs) are trained with massive amounts of documents, and their parameters hold document-related contents. Recently, several studies identified content-specific locations in LLMs by examining the parameters. Instead of the post hoc interpretation, we propose another approach. We propose document-wise memory architecture to track document memories in training. The proposed architecture maps document representations to memory entries, which softly mask memories in the forward process of LLMs. Additionally, we propose document guidance loss, which increases the likelihood of text with document memories and reduces the likelihood of the text with the memories of other documents. Experimental results on Wikitext-103-v1 with Pythia-1B show that the proposed methods provide different memory entries for documents and high recall of document-related content in generation with trained document-wise memories.

Memorizing Documents with Guidance in Large Language Models

TL;DR

This work proposes document-wise memory architecture to track document memories in training, which provides different memory entries for documents and high recall of document-related content in generation with trained document-wise memories.

Abstract

Training data plays a pivotal role in AI models. Large language models (LLMs) are trained with massive amounts of documents, and their parameters hold document-related contents. Recently, several studies identified content-specific locations in LLMs by examining the parameters. Instead of the post hoc interpretation, we propose another approach. We propose document-wise memory architecture to track document memories in training. The proposed architecture maps document representations to memory entries, which softly mask memories in the forward process of LLMs. Additionally, we propose document guidance loss, which increases the likelihood of text with document memories and reduces the likelihood of the text with the memories of other documents. Experimental results on Wikitext-103-v1 with Pythia-1B show that the proposed methods provide different memory entries for documents and high recall of document-related content in generation with trained document-wise memories.
Paper Structure (21 sections, 1 theorem, 10 equations, 12 figures, 4 tables)

This paper contains 21 sections, 1 theorem, 10 equations, 12 figures, 4 tables.

Key Result

Proposition 1

Let $\mathcal{K}_1, \mathcal{K}_2$ be two DocReps with $d_{DocRep}(\mathcal{K}_1,\mathcal{K}_2) \le \epsilon$. When $g$ is $\tau$-Lipschitz, $d_{Key}(g(\mathcal{K}_1), g(\mathcal{K}_2)) \le \tau \epsilon$.

Figures (12)

  • Figure 1: A graphical illustration of document-wise memories. The blue and red vectors indicate memories for two documents. The hidden representation of LLM selects memories (dark arrows), and document-wise entries filter memories for the recall of document contents. Here, only the third vector contributes to the inference.
  • Figure 2: (left) Randomly generated 10 DocReps. (bottom) Memory selection with a DocRep. (right) The perplexity of 3 documents was individually measured with memories selected from all DocReps in 2-dimensional space (xy-plane). Three paraboloids are the perplexity of three documents, and the original document representations have the local minima.
  • Figure 3: Graphical illustration of the document-wise memory. The conditional generation with a DocRep ensures the memory locations of the document. The token representation originally provides key $Key_\mathrm{Tok}$. The proposed architecture combines $Key_\mathrm{Tok}$ with $Key_\mathrm{Doc}$ by element-wise multiplication. This process can be interpreted as a soft masking of activations. The generation of $Key_\mathrm{Doc}$ could be nonlinear.
  • Figure 4: Graphical illustration of three metric spaces. Documents $\mathcal{D}_1$ and $\mathcal{D}_2$ are mapped to $\mathcal{K}_1$ and $\mathcal{K}_2$ respectively with function $f$. Then, two DocReps are mapped to memory entries $K_1$ and $K_2$ with $g$, respectively. When the Lipschitz continuity assumption holds for $f$ and $g$, the similarity score between documents preserves the memory selections. This work focuses on learning memory selection function $g$ by randomly generating DocReps. The continuity of $g$ affects memory entries. The right panel is the memory selection of two documents.
  • Figure 5: Perplexity of three documents with memories selected from DocReps in 2D. The selected memories from zero negative DocRep (center) are encouraged to forget the document contents.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Proposition 1: Lipschitz Continuity for Memory Selection