Table of Contents
Fetching ...

Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models

Minsung Kim, Dong-Kyum Kim, Jea Kwon, Nakyeong Yang, Kyomin Jung, Meeyoung Cha

TL;DR

Large language models rely on both parametric ($Acc_{PKU}$) and in-context ($Acc_{ICKU}$) knowledge, but arbitration strategies appear to emerge during pretraining rather than being fixed. The authors train decoder-only transformers on synthetic biographies under controlled corpus variants to reveal how repetition, inconsistency noise, and Zipfian distributions shape arbitration metrics such as $Pref_{PK}$, $Pref_{ICK}$, $Acc_{PKU}$, and $Acc_{ICKU}$. They show that intra-document repetition enables the co-emergence of both knowledge-utilization modes, that small inconsistency noise biases conflict resolution toward parametric knowledge, and that distributional skew preserves in-context use for unfamiliar entities. These findings challenge traditional data-cleaning norms and offer practical guidelines for designing pretraining data in retrieval-augmented settings to achieve harmonious knowledge arbitration.

Abstract

Large language models often encounter conflicts between in-context knowledge retrieved at inference time and parametric knowledge acquired during pretraining. Models that accept external knowledge uncritically are vulnerable to misinformation, whereas models that adhere rigidly to parametric knowledge fail to benefit from retrieval. Despite the widespread adoption of retrieval-augmented generation, we still lack a systematic understanding of what shapes knowledge-arbitration strategies during training. This gap risks producing pretrained models with undesirable arbitration behaviors and, consequently, wasting substantial computational resources after the pretraining budget has already been spent. To address this problem, we present the first controlled study of how training conditions influence models' use of in-context and parametric knowledge, and how they arbitrate between them. We train transformer-based language models on a synthetic biographies corpus while systematically controlling various conditions. Our experiments reveal that intra-document repetition of facts fosters the development of both parametric and in-context capabilities. Moreover, training on a corpus that contains inconsistent information or distributional skew encourages models to develop robust strategies for leveraging parametric and in-context knowledge. Rather than viewing these non-ideal properties as artifacts to remove, our results indicate that they are important for learning robust arbitration. These insights offer concrete, empirical guidance for pretraining models that harmoniously integrate parametric and in-context knowledge.

Training Dynamics of Parametric and In-Context Knowledge Utilization in Language Models

TL;DR

Large language models rely on both parametric () and in-context () knowledge, but arbitration strategies appear to emerge during pretraining rather than being fixed. The authors train decoder-only transformers on synthetic biographies under controlled corpus variants to reveal how repetition, inconsistency noise, and Zipfian distributions shape arbitration metrics such as , , , and . They show that intra-document repetition enables the co-emergence of both knowledge-utilization modes, that small inconsistency noise biases conflict resolution toward parametric knowledge, and that distributional skew preserves in-context use for unfamiliar entities. These findings challenge traditional data-cleaning norms and offer practical guidelines for designing pretraining data in retrieval-augmented settings to achieve harmonious knowledge arbitration.

Abstract

Large language models often encounter conflicts between in-context knowledge retrieved at inference time and parametric knowledge acquired during pretraining. Models that accept external knowledge uncritically are vulnerable to misinformation, whereas models that adhere rigidly to parametric knowledge fail to benefit from retrieval. Despite the widespread adoption of retrieval-augmented generation, we still lack a systematic understanding of what shapes knowledge-arbitration strategies during training. This gap risks producing pretrained models with undesirable arbitration behaviors and, consequently, wasting substantial computational resources after the pretraining budget has already been spent. To address this problem, we present the first controlled study of how training conditions influence models' use of in-context and parametric knowledge, and how they arbitrate between them. We train transformer-based language models on a synthetic biographies corpus while systematically controlling various conditions. Our experiments reveal that intra-document repetition of facts fosters the development of both parametric and in-context capabilities. Moreover, training on a corpus that contains inconsistent information or distributional skew encourages models to develop robust strategies for leveraging parametric and in-context knowledge. Rather than viewing these non-ideal properties as artifacts to remove, our results indicate that they are important for learning robust arbitration. These insights offer concrete, empirical guidance for pretraining models that harmoniously integrate parametric and in-context knowledge.

Paper Structure

This paper contains 28 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Three knowledge utilization scenarios. Left: parametric knowledge utilization where the model recalls knowledge encoded in its parameters and answers queries about entities seen during training. Middle: in-context knowledge utilization where the model extracts and uses knowledge provided only in the prompt and is evaluated on novel entities not seen during training. Right: knowledge conflict resolution where the model is queried about trained entities while the context provides conflicting information, and responses reveal the preference between parametric knowledge and in-context knowledge.
  • Figure 2: An example of intra-document repetition of key attributes (e.g., German, Physics) for a single entity, alongside our synthetic training-corpus variants. Single uses one paragraph per entity and thus encourages reliance on parametric knowledge; Repeated places two paraphrased paragraphs about the same entity in one document, allowing later mentions to leverage in-context knowledge or parametric knowledge.
  • Figure 3: Accuracy of parametric knowledge utilization (Acc${_\text{PKU}}$) and in-context knowledge utilization (Acc${_\text{ICKU}}$) across training steps. Left: The model trained on the Single corpus shows delayed parametric knowledge utilization and no activation of in-context knowledge utilization. Right: In contrast, the Repeated+Mix corpus induces early in-context knowledge utilization followed by parametric knowledge utilization. Middle: The Repeated corpus remains near random-guess performance on in-context knowledge utilization.
  • Figure 4: (a) Training dynamics of $\mathrm{Acc}_{\mathrm{ICKU}}$, $\mathrm{Acc}_{\mathrm{PKU}}$, $\mathrm{Pref}_{\mathrm{ICK}}$, and $\mathrm{Pref}_{\mathrm{PK}}$ when trained on the Repeated+Mix corpus without noise (Left) and with 1% noise (Right). When the training corpus contains no noise (i.e., no inconsistent knowledge within the same documents), the model consistently prefers in-context knowledge in knowledge conflicts, whereas even a small amount of noise induces a phase shift toward parametric knowledge preference as parametric knowledge utilization stabilizes. (b) Changes in the layer-wise sum of attention mass at the last token of the test probe when the model trained with 1% noise performs in-context knowledge utilization. Green indicates the attention allocated to name tokens in the test probe, while blue indicates the attention allocated to target tokens in the context. (c) $\mathrm{Acc}_{\mathrm{ICKU}}$ at the end of training across different noise levels.
  • Figure 5: $\mathrm{Acc}_{\mathrm{PKU}}$, $\mathrm{Pref}_{\mathrm{ICK}}$, and $\mathrm{Pref}_{\mathrm{PK}}$ for the top 10% (high-frequency) and bottom 10% (low-frequency) entities in the training corpus. For high-frequency entities, $\mathrm{Pref}_{\mathrm{ICK}}$ is initially higher but gradually yields to $\mathrm{Pref}_{\mathrm{PK}}$; for low-frequency entities, $\mathrm{Pref}_{\mathrm{ICK}}$ remains consistently higher.
  • ...and 8 more figures