Table of Contents
Fetching ...

1bit-Merging: Dynamic Quantized Merging for Large Language Models

Shuqi Liu, Yuxuan Yao, Bowei He, Zehua Liu, Xiongwei Han, Mingxuan Yuan, Han Wu, Linqi Song

TL;DR

1bit-Merging tackles the challenge of combining specialized LLMs by marrying task-specific routing with 1-bit quantized task vectors, enabling dynamic, task-aware merging with reduced storage. The approach leverages module-level knowledge locality, quantizing task vectors on a per-module basis and selecting the most relevant vector via a lightweight router before fusing the rest with a principled merging method. Empirical results across LLaMA2 and Mistral families show that 1bit-Merging matches or exceeds traditional merging while lowering storage costs, and scales effectively to larger architectures. The work highlights practical pathways for deploying composite, domain-competent models with manageable footprint and robust cross-domain performance.

Abstract

Recent advances in large language models have led to specialized models excelling in specific domains, creating a need for efficient model merging techniques. While traditional merging approaches combine parameters into a single static model, they often compromise task-specific performance. However, task-specific routing methods maintain accuracy but introduce substantial storage overhead. We present \texttt{1bit}-Merging, a novel framework that integrates task-specific routing with 1-bit quantized task vectors to balance performance and storage efficiency. Our approach leverages the observation that different task-specific models store knowledge in distinct layers-chat models primarily in attention layers and math/code models in MLP layers, enabling targeted compression strategies. Through extensive experiments with LLaMA2 and Mistral model families across chat, mathematical reasoning, and code generation tasks, we demonstrate that 1bit-Merging achieves comparable or superior performance to existing methods while significantly reducing storage requirements. Our framework offers a practical solution for combining specialized models while maintaining their individual strengths and addressing the storage challenges of current approaches.

1bit-Merging: Dynamic Quantized Merging for Large Language Models

TL;DR

1bit-Merging tackles the challenge of combining specialized LLMs by marrying task-specific routing with 1-bit quantized task vectors, enabling dynamic, task-aware merging with reduced storage. The approach leverages module-level knowledge locality, quantizing task vectors on a per-module basis and selecting the most relevant vector via a lightweight router before fusing the rest with a principled merging method. Empirical results across LLaMA2 and Mistral families show that 1bit-Merging matches or exceeds traditional merging while lowering storage costs, and scales effectively to larger architectures. The work highlights practical pathways for deploying composite, domain-competent models with manageable footprint and robust cross-domain performance.

Abstract

Recent advances in large language models have led to specialized models excelling in specific domains, creating a need for efficient model merging techniques. While traditional merging approaches combine parameters into a single static model, they often compromise task-specific performance. However, task-specific routing methods maintain accuracy but introduce substantial storage overhead. We present \texttt{1bit}-Merging, a novel framework that integrates task-specific routing with 1-bit quantized task vectors to balance performance and storage efficiency. Our approach leverages the observation that different task-specific models store knowledge in distinct layers-chat models primarily in attention layers and math/code models in MLP layers, enabling targeted compression strategies. Through extensive experiments with LLaMA2 and Mistral model families across chat, mathematical reasoning, and code generation tasks, we demonstrate that 1bit-Merging achieves comparable or superior performance to existing methods while significantly reducing storage requirements. Our framework offers a practical solution for combining specialized models while maintaining their individual strengths and addressing the storage challenges of current approaches.

Paper Structure

This paper contains 34 sections, 9 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: While individually fine-tuned models excel only in their specialized domains, our 1bit-Merging achieves superior performance across all domains.
  • Figure 2: Performance comparison of different strategies on GSM8K on LLaMA2-7B series. TIES-Merging achieves superior performance when choosing Math FT model as base model.
  • Figure 3: Performance vs. storage trade-offs for Mistral 7B deployment. Task-specific routing (Routing) achieves strong performance but requires full parameter storage. Model merging reduces storage at the cost of performance. 1bit-Merging strikes a better balance, and applying 1-bit quantization to all linear layers in Chat ("w/ Chat on Linear") further improves efficiency.