Table of Contents
Fetching ...

On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion

Chenghao Fan, Zhenyi Lu, Wei Wei, Jie Tian, Xiaoye Qu, Dangyang Chen, Yu Cheng

TL;DR

This work tackles the high resource cost of adapting large language models by transferring knowledge from multiple task-specific small models to a larger model without updating its parameters. It introduces a dynamic logit fusion mechanism that optimizes per-step fusion weights under KL-divergence constraints, extending naturally from a single expert to multiple domain experts. Across single-task and multi-task benchmarks on the LLaMA/Llama2 family, the method achieves substantial performance gains, closes much of the gap to full fine-tuning of a larger model, and shows strong generalization to unseen tasks. The approach also integrates with in-context learning for single-task settings and with task arithmetic for multi-task scenarios, offering a practical, data-free, and memory-efficient pathway to strong task-specific performance.

Abstract

Efficient fine-tuning of large language models for task-specific applications is imperative, yet the vast number of parameters in these models makes their training increasingly challenging. Despite numerous proposals for effective methods, a substantial memory overhead remains for gradient computations during updates. \thm{Can we fine-tune a series of task-specific small models and transfer their knowledge directly to a much larger model without additional training?} In this paper, we explore weak-to-strong specialization using logit arithmetic, facilitating a direct answer to this question. Existing weak-to-strong methods often employ a static knowledge transfer ratio and a single small model for transferring complex knowledge, which leads to suboptimal performance. % To address this, To surmount these limitations, we propose a dynamic logit fusion approach that works with a series of task-specific small models, each specialized in a different task. This method adaptively allocates weights among these models at each decoding step, learning the weights through Kullback-Leibler divergence constrained optimization problems. We conduct extensive experiments across various benchmarks in both single-task and multi-task settings, achieving leading results. By transferring expertise from the 7B model to the 13B model, our method closes the performance gap by 96.4\% in single-task scenarios and by 86.3\% in multi-task scenarios compared to full fine-tuning of the 13B model. Notably, we achieve surpassing performance on unseen tasks. Moreover, we further demonstrate that our method can effortlessly integrate in-context learning for single tasks and task arithmetic for multi-task scenarios.

On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion

TL;DR

This work tackles the high resource cost of adapting large language models by transferring knowledge from multiple task-specific small models to a larger model without updating its parameters. It introduces a dynamic logit fusion mechanism that optimizes per-step fusion weights under KL-divergence constraints, extending naturally from a single expert to multiple domain experts. Across single-task and multi-task benchmarks on the LLaMA/Llama2 family, the method achieves substantial performance gains, closes much of the gap to full fine-tuning of a larger model, and shows strong generalization to unseen tasks. The approach also integrates with in-context learning for single-task settings and with task arithmetic for multi-task scenarios, offering a practical, data-free, and memory-efficient pathway to strong task-specific performance.

Abstract

Efficient fine-tuning of large language models for task-specific applications is imperative, yet the vast number of parameters in these models makes their training increasingly challenging. Despite numerous proposals for effective methods, a substantial memory overhead remains for gradient computations during updates. \thm{Can we fine-tune a series of task-specific small models and transfer their knowledge directly to a much larger model without additional training?} In this paper, we explore weak-to-strong specialization using logit arithmetic, facilitating a direct answer to this question. Existing weak-to-strong methods often employ a static knowledge transfer ratio and a single small model for transferring complex knowledge, which leads to suboptimal performance. % To address this, To surmount these limitations, we propose a dynamic logit fusion approach that works with a series of task-specific small models, each specialized in a different task. This method adaptively allocates weights among these models at each decoding step, learning the weights through Kullback-Leibler divergence constrained optimization problems. We conduct extensive experiments across various benchmarks in both single-task and multi-task settings, achieving leading results. By transferring expertise from the 7B model to the 13B model, our method closes the performance gap by 96.4\% in single-task scenarios and by 86.3\% in multi-task scenarios compared to full fine-tuning of the 13B model. Notably, we achieve surpassing performance on unseen tasks. Moreover, we further demonstrate that our method can effortlessly integrate in-context learning for single tasks and task arithmetic for multi-task scenarios.
Paper Structure (35 sections, 25 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 35 sections, 25 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison between our work and previous work. Previous methods only use pre-tuned parameters $\alpha$ to transfer knowledge from a single expert. In contrast, our method dynamically adjusts the proportion of knowledge transferred from multiple experts at each decoding step during inference.
  • Figure 2: Compare the pre-defined $\alpha$ with the dynamic $\alpha$ for different tasks.
  • Figure 3: The variation of $\alpha$ in knowledge transfer for the GSM8K expert.
  • Figure 4: The variation of $\alpha$ for the four experts during knowledge transfer on an unseen task (MMLU: abstract algebra).
  • Figure 5: Enhance in-context learning and task arithmetic using our method.
  • ...and 1 more figures