Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression

Zilun Zhang; Yutao Sun; Tiancheng Zhao; Leigang Sha; Ruochen Xu; Kyusong Lee; Jianwei Yin

Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression

Zilun Zhang, Yutao Sun, Tiancheng Zhao, Leigang Sha, Ruochen Xu, Kyusong Lee, Jianwei Yin

TL;DR

This work tackles catastrophic forgetting in LLMs and MLLMs during domain adaptation by introducing model-agnostic Tree Generation (TG), a self-decompression approach that dumps knowledge from an LLM into a training corpus. The TG framework, and its TG-SFT variant for supervised fine-tuning, constructs a structured, tree-formed dialogue corpus via layered recursive prompts and semantic deduplication, mitigating forgetting while preserving general capabilities. Key contributions include a formalization of TG with tunable parameters $N_i$ and $L_i$, the introduction of Wide-Tree and Balance-Tree variants, and evidence that TG-SFT can restore LLM benchmark performance comparable to human data and approach LLaVA Full-Param baselines on MLLMs. The approach extends to post-pretraining through TG-PT, showing that decompressed corpora can outperform random data, highlighting TG’s potential for knowledge distillation, continual learning, and broader model generalization across tasks and domains.

Abstract

Humans can retain old knowledge while learning new information, but Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data. Moreover, for Multimodal Large Language Models (MLLMs) which are composed of the LLM base and visual projector (e.g. LLaVA), a significant decline in performance on language benchmarks was observed compared to their single-modality counterparts. To address these challenges, we introduce a novel model-agnostic self-decompression method, Tree Generation (TG), that decompresses knowledge within LLMs into the training corpus. This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps. By incorporating the dumped corpus during SFT for MLLMs, we significantly reduce the forgetting problem.

Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression

TL;DR

and

, the introduction of Wide-Tree and Balance-Tree variants, and evidence that TG-SFT can restore LLM benchmark performance comparable to human data and approach LLaVA Full-Param baselines on MLLMs. The approach extends to post-pretraining through TG-PT, showing that decompressed corpora can outperform random data, highlighting TG’s potential for knowledge distillation, continual learning, and broader model generalization across tasks and domains.

Abstract

Paper Structure (36 sections, 6 equations, 5 figures, 3 tables)

This paper contains 36 sections, 6 equations, 5 figures, 3 tables.

Introduction
Related Work
Methods for Preventing Catastrophic Forgetting
Sythetic Data for LLM Training
Data Extraction from LLMs and LLM Self-Iteration
Methodology
Initialization and Layered Expansion
Recursive Dialogue Generation
Corpus Construction
Structural Features of TG-SFT
Experiments
Experimental Settings
Models.
Data.
Evaluation.
...and 21 more sections

Figures (5)

Figure 1: The motivation of Our Work. Shadow represents the error bar. The SFT of MLLM harms the language ability of its LLM backbone (MLLM has begun to forget its general language ability while training is processed). We choose the LLaMA2-7B-chat model as the LLM backbone for the experiments. Details of this experiment can be found in Appendix \ref{['appendix:figure1']}. The first data point is evaluated from the checkpoint of 3000 steps. We averaged the results of 10 MLLM benchmarks and 6 LLM benchmarks respectively and normalized them with the result of the first checkpoint to show the trend (Increased performance if the score is greater than one. Decreased performance if the score is less than 1).
Figure 2: TG-SFT structure overview, illustrates a three-layer complete tree structure. In practice, the depth of different leaf nodes can be adjusted as needed. This figure depicts a typical form of a Balance-Tree, whereas in a Wide-Tree, no further branching occurs beyond the second layer. Starting from the first layer, all odd-numbered layers serve as question layers, and even-numbered layers serve as answer layers.
Figure 3: Example of Concatenated Prompts: This figure uses the Llama2-chat model as the backbone LLM. In Llama2, the system prompt is enclosed with "<<SYS>>", "[INST]" indicates the start of an instruction, signifying the beginning of generation in the user role. "[/INST]" marks the end of the instruction. LLMs are trained to start responding from this point in the pre-training phase.
Figure 4: T-SNE data visualization for corpus generated using TG-SFT and collected from ShareGPT
Figure 5: Number of turns in TG-SFT decompressed Data

Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression

TL;DR

Abstract

Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression

Authors

TL;DR

Abstract

Table of Contents

Figures (5)