CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

Lukas Thede; Stefan Winzeck; Zeynep Akata; Jonathan Richard Schwarz

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

Lukas Thede, Stefan Winzeck, Zeynep Akata, Jonathan Richard Schwarz

TL;DR

CapTrack is introduced, a capability-centric framework for analyzing forgetting in LLMs that combines a behavioral taxonomy with an evaluation suite built on established benchmarks and targeted adaptations that finds that forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors.

Abstract

Large language model (LLM) post-training enhances latent skills, unlocks value alignment, improves performance, and enables domain adaptation. Unfortunately, post-training is known to induce forgetting, especially in the ubiquitous use-case of leveraging third-party pre-trained models, which is typically understood as a loss of parametric or factual knowledge. We argue that this accuracy-centric view is insufficient for modern foundation models and instead define forgetting as systematic model drift that degrades behavior and user experience. In this context, we introduce \textbf{CapTrack}, a capability-centric framework for analyzing forgetting in LLMs that combines a behavioral taxonomy with an evaluation suite built on established benchmarks and targeted adaptations. Using CapTrack, we conduct a large-scale empirical study across post-training algorithms, domains, and model families, including models up to 80B parameters. We find that forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors. Instruction fine-tuning induces the strongest relative drift, while preference optimization is more conservative and can partially recover lost capabilities. Differences across model families persist, and no universal mitigation emerges.

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

TL;DR

Abstract

Paper Structure (93 sections, 2 equations, 15 figures, 6 tables)

This paper contains 93 sections, 2 equations, 15 figures, 6 tables.

Introduction
Related Work
Post-Training of Large Language Models
Forgetting in Neural Networks and LLMs
CapTrack: Capability-Level Forgetting
CapTrack Taxonomy of LLM Capabilities
CapTrack Evaluation Suite
Experiments
Experimental Setup
Forgetting Capabilities Across Post-Training Stages
Capability-Specific Analysis
Effectiveness of Mitigating Forgetting
Data-Centric Mitigation: Domain-Specific vs. General Post-Training
Architectural Mitigation: Model Merging
Regularization-Based Mitigation: Parameter-Efficient Fine-Tuning
...and 78 more sections

Figures (15)

Figure 1: Average forgetting across post-training stages (relative to OOB; higher is better). Forgetting extends beyond factual knowledge, with strongest degradation under IFT, milder effects under DPO, and partial recovery when DPO is applied after IFT.
Figure 2: Capability-level forgetting profiles on the legal domain, aggregated across model sizes and shown per model family (Qwen, LLaMA, Gemma) for IFT and DPO. Further results for IFT+DPO and the medical domain are provided in Appendix \ref{['app:additional-results']}. Radial distance indicates binned forgetting severity (see Section 4). Axes correspond to CapTrack sub-categories: CAN; C1 knowledge, C2 reasoning, C3 ICL, C4 faithfulness, C5a prompt robustness, C5b domain robustness, C5c multilingual; WILL; W1a unsafe refusal, W1b underspecified compliance, W2a coverage, W2b overreach, W3a verbosity, W3b formatting; HOW; H1 instruction following, H2 format fidelity, H3 tool use, H4 multi-turn consistency, H5 long-context, H6 citation. Results reveal strong cross-family differences, with distinct strengths and vulnerabilities across capability groups.
Figure 3: (Left) Forgetting differences under general (Tulu 3) vs. domain-specific (legal) post-training. (Right) Stability--plasticity trade-offs for regularized adaptation methods on the legal domain.
Figure 4: Pairwise correlation of LLM-as-a-judge scores across candidate judge models, computed separately for each benchmark requiring a judge. Scores are based on 100 responses per benchmark generated by the out-of-the-box Llama-3.3-70B-Instruct model. GPT-4o-mini shows consistently high correlation with strong frontier judges such as Claude Opus 4.1 and Gemini 2.5 Pro while offering substantially lower inference cost and latency, motivating its use as the default judge in CapTrack.
Figure 5: Extended spider plot results across model families. Left: Legal-domain results including DPO, IFT, and IFT+DPO. Right: Corresponding results for the medical domain. Each spider plot shows results averaged across model families for CAN (latent competence), WILL (default behavioral preferences), and HOW (protocol compliance), with radial distance indicating increasing forgetting. These figures complement the main paper spider plots by showing the combined post-training setting and domain-specific effects.
...and 10 more figures

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

TL;DR

Abstract

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

Authors

TL;DR

Abstract

Table of Contents

Figures (15)