Table of Contents
Fetching ...

Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

Avni Mittal

Abstract

Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.

Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

Abstract

Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.
Paper Structure (84 sections, 1 equation, 9 figures, 14 tables)

This paper contains 84 sections, 1 equation, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Overview of the experimental pipeline. A verifiable formatting constraint from IFEval is composed with a benchmark task of varying difficulty (TriviaQA, MMLU, GSM8K, or CNN/DailyMail) using either the natural embedding template or the salience-enhanced reminder template. The model's response is then evaluated along two independent axes: (1) deterministic IFEval compliance checking (strict and loose) and (2) task-specific accuracy verification. This dual-evaluation design enables simultaneous measurement of prospective memory failure (compliance drop) and dual-task interference (accuracy drop).
  • Figure 2: Main result. IFEval compliance under increasing task complexity, with shaded 95% CIs from 3 independent runs. (a) With salience-enhanced prompt: compliance stays flat at 90--100%. (b) Natural embedding: compliance drops consistently as distraction difficulty increases. Both conditions share the same baseline (no additional task).
  • Figure 3: Forgetting deltas by distraction type and model, with 95% CI error bars from 3 runs. Positive values indicate compliance dropped vs. baseline. DeepSeek shows the largest forgetting; Llama is most robust.
  • Figure 4: Compliance gain from the salience-enhanced prompt format, by distraction type, with 95% CI error bars. The effect is largest for the long-context condition (CNN/DailyMail (CNN/DM): +13 to +21%).
  • Figure 5: Instruction-type vulnerability heatmap. Each cell shows the forgetting delta (baseline minus natural + GSM8K, in %). Sorted by average vulnerability across models. Terminal and structural constraints cluster at the top (most forgotten); avoidance constraints cluster at the bottom (most robust).
  • ...and 4 more figures