Table of Contents
Fetching ...

Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression

Zilun Zhang, Yutao Sun, Tiancheng Zhao, Leigang Sha, Ruochen Xu, Kyusong Lee, Jianwei Yin

TL;DR

This work tackles catastrophic forgetting in LLMs and MLLMs during domain adaptation by introducing model-agnostic Tree Generation (TG), a self-decompression approach that dumps knowledge from an LLM into a training corpus. The TG framework, and its TG-SFT variant for supervised fine-tuning, constructs a structured, tree-formed dialogue corpus via layered recursive prompts and semantic deduplication, mitigating forgetting while preserving general capabilities. Key contributions include a formalization of TG with tunable parameters $N_i$ and $L_i$, the introduction of Wide-Tree and Balance-Tree variants, and evidence that TG-SFT can restore LLM benchmark performance comparable to human data and approach LLaVA Full-Param baselines on MLLMs. The approach extends to post-pretraining through TG-PT, showing that decompressed corpora can outperform random data, highlighting TG’s potential for knowledge distillation, continual learning, and broader model generalization across tasks and domains.

Abstract

Humans can retain old knowledge while learning new information, but Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data. Moreover, for Multimodal Large Language Models (MLLMs) which are composed of the LLM base and visual projector (e.g. LLaVA), a significant decline in performance on language benchmarks was observed compared to their single-modality counterparts. To address these challenges, we introduce a novel model-agnostic self-decompression method, Tree Generation (TG), that decompresses knowledge within LLMs into the training corpus. This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps. By incorporating the dumped corpus during SFT for MLLMs, we significantly reduce the forgetting problem.

Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression

TL;DR

This work tackles catastrophic forgetting in LLMs and MLLMs during domain adaptation by introducing model-agnostic Tree Generation (TG), a self-decompression approach that dumps knowledge from an LLM into a training corpus. The TG framework, and its TG-SFT variant for supervised fine-tuning, constructs a structured, tree-formed dialogue corpus via layered recursive prompts and semantic deduplication, mitigating forgetting while preserving general capabilities. Key contributions include a formalization of TG with tunable parameters and , the introduction of Wide-Tree and Balance-Tree variants, and evidence that TG-SFT can restore LLM benchmark performance comparable to human data and approach LLaVA Full-Param baselines on MLLMs. The approach extends to post-pretraining through TG-PT, showing that decompressed corpora can outperform random data, highlighting TG’s potential for knowledge distillation, continual learning, and broader model generalization across tasks and domains.

Abstract

Humans can retain old knowledge while learning new information, but Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data. Moreover, for Multimodal Large Language Models (MLLMs) which are composed of the LLM base and visual projector (e.g. LLaVA), a significant decline in performance on language benchmarks was observed compared to their single-modality counterparts. To address these challenges, we introduce a novel model-agnostic self-decompression method, Tree Generation (TG), that decompresses knowledge within LLMs into the training corpus. This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps. By incorporating the dumped corpus during SFT for MLLMs, we significantly reduce the forgetting problem.
Paper Structure (36 sections, 6 equations, 5 figures, 3 tables)

This paper contains 36 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The motivation of Our Work. Shadow represents the error bar. The SFT of MLLM harms the language ability of its LLM backbone (MLLM has begun to forget its general language ability while training is processed). We choose the LLaMA2-7B-chat model as the LLM backbone for the experiments. Details of this experiment can be found in Appendix \ref{['appendix:figure1']}. The first data point is evaluated from the checkpoint of 3000 steps. We averaged the results of 10 MLLM benchmarks and 6 LLM benchmarks respectively and normalized them with the result of the first checkpoint to show the trend (Increased performance if the score is greater than one. Decreased performance if the score is less than 1).
  • Figure 2: TG-SFT structure overview, illustrates a three-layer complete tree structure. In practice, the depth of different leaf nodes can be adjusted as needed. This figure depicts a typical form of a Balance-Tree, whereas in a Wide-Tree, no further branching occurs beyond the second layer. Starting from the first layer, all odd-numbered layers serve as question layers, and even-numbered layers serve as answer layers.
  • Figure 3: Example of Concatenated Prompts: This figure uses the Llama2-chat model as the backbone LLM. In Llama2, the system prompt is enclosed with "<<SYS>>", "[INST]" indicates the start of an instruction, signifying the beginning of generation in the user role. "[/INST]" marks the end of the instruction. LLMs are trained to start responding from this point in the pre-training phase.
  • Figure 4: T-SNE data visualization for corpus generated using TG-SFT and collected from ShareGPT
  • Figure 5: Number of turns in TG-SFT decompressed Data