Table of Contents
Fetching ...

Task Matters: Knowledge Requirements Shape LLM Responses to Context-Memory Conflict

Kaiser Sun, Fan Bai, Mark Dredze

TL;DR

The paper investigates how context-memory conflict in LLMs depends on the knowledge demands of the task. It introduces a model-agnostic diagnostic framework that fixes parametric knowledge while manipulating task formulations and evidence plausibility to create task-specific conflict datasets. Across multiple open-source models, it finds that conflicts severely impair knowledge-intensive tasks, while simpler, knowledge-free tasks are less affected; strategies like rationales or reiteration can either help or harm depending on the task. The work also shows that using LLMs as evaluators can be biased by their own parametric knowledge, underscoring the need for task-aware evaluation and deployment that balances contextual and parametric knowledge. Overall, the study provides a unified, task-aware view of context-memory conflict with practical implications for evaluation, prompting dynamic, task-driven control of context and memory in LLM systems.

Abstract

Large language models (LLMs) draw on both contextual information and parametric memory, yet these sources can conflict. Prior studies have largely examined this issue in contextual question answering, implicitly assuming that tasks should rely on the provided context, leaving unclear how LLMs behave when tasks require different types and degrees of knowledge utilization. We address this gap with a model-agnostic diagnostic framework that holds underlying knowledge constant while introducing controlled conflicts across tasks with varying knowledge demands. Experiments on representative open-source LLMs show that performance degradation under conflict is driven by both task-specific knowledge reliance and conflict plausibility; that strategies such as rationales or context reiteration increase context reliance, helping context-only tasks but harming those requiring parametric knowledge; and that these effects bias model-based evaluation, calling into question the reliability of LLMs as judges. Overall, our findings reveal that context-memory conflict is inherently task-dependent and motivate task-aware approaches to balancing context and memory in LLM deployment and evaluation.

Task Matters: Knowledge Requirements Shape LLM Responses to Context-Memory Conflict

TL;DR

The paper investigates how context-memory conflict in LLMs depends on the knowledge demands of the task. It introduces a model-agnostic diagnostic framework that fixes parametric knowledge while manipulating task formulations and evidence plausibility to create task-specific conflict datasets. Across multiple open-source models, it finds that conflicts severely impair knowledge-intensive tasks, while simpler, knowledge-free tasks are less affected; strategies like rationales or reiteration can either help or harm depending on the task. The work also shows that using LLMs as evaluators can be biased by their own parametric knowledge, underscoring the need for task-aware evaluation and deployment that balances contextual and parametric knowledge. Overall, the study provides a unified, task-aware view of context-memory conflict with practical implications for evaluation, prompting dynamic, task-driven control of context and memory in LLM systems.

Abstract

Large language models (LLMs) draw on both contextual information and parametric memory, yet these sources can conflict. Prior studies have largely examined this issue in contextual question answering, implicitly assuming that tasks should rely on the provided context, leaving unclear how LLMs behave when tasks require different types and degrees of knowledge utilization. We address this gap with a model-agnostic diagnostic framework that holds underlying knowledge constant while introducing controlled conflicts across tasks with varying knowledge demands. Experiments on representative open-source LLMs show that performance degradation under conflict is driven by both task-specific knowledge reliance and conflict plausibility; that strategies such as rationales or context reiteration increase context reliance, helping context-only tasks but harming those requiring parametric knowledge; and that these effects bias model-based evaluation, calling into question the reliability of LLMs as judges. Overall, our findings reveal that context-memory conflict is inherently task-dependent and motivate task-aware approaches to balancing context and memory in LLM deployment and evaluation.

Paper Structure

This paper contains 42 sections, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Overview of the types of contexts and tasks in our evaluation. Context types vary in the level of conflict, while the tasks impose different knowledge constraints.
  • Figure 2: Overall diagnostic data creation flow. The lower portion is a zoom in of Evidence Creation step. After collecting the test model's parametric knowledge, the supporting passages are further edited to reveal multiple levels of conflict (2. Evidence Creation) and appear in different tasks (3. Task-Annotation).
  • Figure 3: Performance of each model on different task types. A clear trend of NC > HPC / LPC is shown across tasks involving knowledge utilization.
  • Figure 4: Performance on high plausibility contradiction instances with (HPCE) and without (HPC) explanations.
  • Figure 5: Averaged error distribution on RAG and PCK tasks. NC Only represents that the model only provides the NC answer; PC Only represents that the model only provides the PC answer; Both Wrong represents the case where the model provides neither PC or NC answer.
  • ...and 12 more figures