Table of Contents
Fetching ...

LEWIS (LayEr WIse Sparsity) -- A Training Free Guided Model Merging Approach

Hetarth Chopra, Vidhi Rambhia, Vikram Adve

TL;DR

LEWIS tackles the challenge of data-free model merging by introducing activation-based, layer-wise sparsity guidance derived from a calibration dataset. By computing per-layer importance from activation norms and constraining pruning within bounds, LEWIS preserves essential task-specific knowledge while merging fine-tuned models with methods like TIES and DARE. Empirical results on code instruction-following and math solving demonstrate consistent improvements, including up to 11.3% FE and 11.2% SM in GSM8K-related tasks, validating the approach across merging frameworks. This work enables more effective, targeted merging that enhances task specialization without additional data or retraining, with potential extension to broader domains and architectures.

Abstract

As specialized large language models (LLMs) become increasingly prevalent, model merging methods are being used to combine them to create a single multi-task model without requiring any additional data or training. However, these approaches fall short when the objective of merging is to increase the downstream model's performance on a particular task-specific benchmark. In this work, we propose LEWIS (Layer Wise Sparsity), a guided model-merging framework that uses activation-based layer importance to dynamically adjust layer-wise task-vector sparsity required for the merge process. LEWIS uses a calibration dataset to prioritize critical layers during the task-vector pruning process required for model merging. This approach guides existing merging methods by preserving essential layer-wise task-specific knowledge while ensuring the merged model performs the best at benchmarks resembling the calibration dataset. Our experiments demonstrate the effectiveness of LEWIS with performance improvements of code instruction-following and math-solving models created through model merging up to 4 percent and 11.3 percent, respectively, outperforming unguided data-less model merging approaches that use uniform-sparsity.

LEWIS (LayEr WIse Sparsity) -- A Training Free Guided Model Merging Approach

TL;DR

LEWIS tackles the challenge of data-free model merging by introducing activation-based, layer-wise sparsity guidance derived from a calibration dataset. By computing per-layer importance from activation norms and constraining pruning within bounds, LEWIS preserves essential task-specific knowledge while merging fine-tuned models with methods like TIES and DARE. Empirical results on code instruction-following and math solving demonstrate consistent improvements, including up to 11.3% FE and 11.2% SM in GSM8K-related tasks, validating the approach across merging frameworks. This work enables more effective, targeted merging that enhances task specialization without additional data or retraining, with potential extension to broader domains and architectures.

Abstract

As specialized large language models (LLMs) become increasingly prevalent, model merging methods are being used to combine them to create a single multi-task model without requiring any additional data or training. However, these approaches fall short when the objective of merging is to increase the downstream model's performance on a particular task-specific benchmark. In this work, we propose LEWIS (Layer Wise Sparsity), a guided model-merging framework that uses activation-based layer importance to dynamically adjust layer-wise task-vector sparsity required for the merge process. LEWIS uses a calibration dataset to prioritize critical layers during the task-vector pruning process required for model merging. This approach guides existing merging methods by preserving essential layer-wise task-specific knowledge while ensuring the merged model performs the best at benchmarks resembling the calibration dataset. Our experiments demonstrate the effectiveness of LEWIS with performance improvements of code instruction-following and math-solving models created through model merging up to 4 percent and 11.3 percent, respectively, outperforming unguided data-less model merging approaches that use uniform-sparsity.

Paper Structure

This paper contains 13 sections, 2 equations, 1 figure, 6 tables, 1 algorithm.

Figures (1)

  • Figure 1: Process flow of the LEWIS framework: We show an example of how a calibration dataset (containing coding problems) can be used to compute layer-wise importance for a baseline LLM and it's finetunes, enabling selective ask-vector pruning and merging to perform best on benchmarks containing coding problems.