MUSCLE: A Model Update Strategy for Compatible LLM Evolution

Jessica Echterhoff; Fartash Faghri; Raviteja Vemulapalli; Ting-Yao Hu; Chun-Liang Li; Oncel Tuzel; Hadi Pouransari

MUSCLE: A Model Update Strategy for Compatible LLM Evolution

Jessica Echterhoff, Fartash Faghri, Raviteja Vemulapalli, Ting-Yao Hu, Chun-Liang Li, Oncel Tuzel, Hadi Pouransari

TL;DR

It is found that when pretrained LLM base models are updated, fine-tuned user-facing downstream task adapters experience negative flips -- previously correct instances are now predicted incorrectly, even when the downstream task training procedures remain identical.

Abstract

Large Language Models (LLMs) are regularly updated to enhance performance, typically through changes in data or architecture. Within the update process, developers often prioritize improving overall performance metrics, paying less attention to maintaining compatibility with earlier model versions. Instance-level degradation (instance regression) of performance from one model version to the next can interfere with a user's mental model of the capabilities of a particular language model. Users having to adapt their mental model with every update can lead to dissatisfaction, especially when the new model has degraded compared to a prior version for a known use case (model update regression). We find that when pretrained LLM base models are updated, fine-tuned user-facing downstream task adapters experience negative flips -- previously correct instances are now predicted incorrectly. We observe model update regression between different model versions on a diverse set of tasks and models, even when the downstream task training procedures remain identical. We argue for the importance of maintaining model update compatibility during updates, and present evaluation metrics designed specifically for generative tasks, while also being applicable to discriminative tasks. We propose a training strategy to minimize the extent of instance regression in model updates, involving training of a compatibility adapter that can enhance task fine-tuned language models. We show negative flips reduce by up to 40% e.g. when updating Llama 1 to Llama 2 with our proposed method.

MUSCLE: A Model Update Strategy for Compatible LLM Evolution

TL;DR

Abstract

Paper Structure (45 sections, 16 equations, 5 figures, 10 tables)

This paper contains 45 sections, 16 equations, 5 figures, 10 tables.

Introduction
Related Work
Measuring Model Update Regression
Classification
Reducing Model Update Regression
Model Ensembles
Knowledge Distillation
Problem Formulation
Setup
Backward Compatibility Metrics
Unobserved Inconsistencies
Continuous Metrics
Extended Evaluation Metrics
Accounting for Flips when Both Models are Incorrect
Smooth Compatibility Metrics
...and 30 more sections

Figures (5)

Figure 1: A real example of a model update that introduces instance regression (negative flip, where a previously correct prediction becomes incorrect) (top). With our model update strategy using a compatibility adapter approach, we enhance model update compatibility to the previous model while maintaining the overall performance gain (e.g. measured by the ROUGE-1 score for the summarization task) of the model update (bottom).
Figure 2: Four possibilities arise for each sample when a model is updated. Quadrants 2 and 4 show positive and negative flips, respectively. Quadrant 3 corresponds to instances where both models are incorrect. Encouraging similarity between the old and new models in this case (i.e., making the same mistakes) results in a more seamless model update from the user's perspective.
Figure 3: When updating a model, regression on individual tokens and instances can arise. We use a masked approach to select tokens to be aligned with knowledge distillation either with the old version to remain consistent or with the new task model to increase performance.
Figure 4: When updating LLM models (e.g. Llama 1 $\rightarrow$ Llama 2), we observe negative flips in different tasks. The smaller the performance gap from an old model to a new model, the more negative flips we observe. We indicate the performance gap by the difference in exact match for GSM8K, Rouge-1 for SAMSum, and log-likelihood-based accuracy for PIQA and HellaSwag. When evaluating continuous metrics with absolute ROUGE-1 value for summarization on SAMSum, we observe a large fraction of negative flips. We show the exact models analyzed in \ref{['tab:mp']}.
Figure 5: Comparison of NFR vs NFR$_{mc}$ metrics to evaluate inconsistency when updating LLMs for HellaSwag task. We see that using our compatibility adapter (denoted by $_c$), we can reduce inconsistency for Llama and Vicuna models.

MUSCLE: A Model Update Strategy for Compatible LLM Evolution

TL;DR

Abstract

MUSCLE: A Model Update Strategy for Compatible LLM Evolution

Authors

TL;DR

Abstract

Table of Contents

Figures (5)