On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion

Chenghao Fan; Zhenyi Lu; Wei Wei; Jie Tian; Xiaoye Qu; Dangyang Chen; Yu Cheng

On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion

Chenghao Fan, Zhenyi Lu, Wei Wei, Jie Tian, Xiaoye Qu, Dangyang Chen, Yu Cheng

TL;DR

This work tackles the high resource cost of adapting large language models by transferring knowledge from multiple task-specific small models to a larger model without updating its parameters. It introduces a dynamic logit fusion mechanism that optimizes per-step fusion weights under KL-divergence constraints, extending naturally from a single expert to multiple domain experts. Across single-task and multi-task benchmarks on the LLaMA/Llama2 family, the method achieves substantial performance gains, closes much of the gap to full fine-tuning of a larger model, and shows strong generalization to unseen tasks. The approach also integrates with in-context learning for single-task settings and with task arithmetic for multi-task scenarios, offering a practical, data-free, and memory-efficient pathway to strong task-specific performance.

Abstract

Efficient fine-tuning of large language models for task-specific applications is imperative, yet the vast number of parameters in these models makes their training increasingly challenging. Despite numerous proposals for effective methods, a substantial memory overhead remains for gradient computations during updates. \thm{Can we fine-tune a series of task-specific small models and transfer their knowledge directly to a much larger model without additional training?} In this paper, we explore weak-to-strong specialization using logit arithmetic, facilitating a direct answer to this question. Existing weak-to-strong methods often employ a static knowledge transfer ratio and a single small model for transferring complex knowledge, which leads to suboptimal performance. % To address this, To surmount these limitations, we propose a dynamic logit fusion approach that works with a series of task-specific small models, each specialized in a different task. This method adaptively allocates weights among these models at each decoding step, learning the weights through Kullback-Leibler divergence constrained optimization problems. We conduct extensive experiments across various benchmarks in both single-task and multi-task settings, achieving leading results. By transferring expertise from the 7B model to the 13B model, our method closes the performance gap by 96.4\% in single-task scenarios and by 86.3\% in multi-task scenarios compared to full fine-tuning of the 13B model. Notably, we achieve surpassing performance on unseen tasks. Moreover, we further demonstrate that our method can effortlessly integrate in-context learning for single tasks and task arithmetic for multi-task scenarios.

On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion

TL;DR

Abstract

Paper Structure (35 sections, 25 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 35 sections, 25 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Efficient Specialization
Weak-to-Strong Generation
Methodology
Problem Background
Autoregressive Language Models
Distance Between Language Model Outputs Distribution
Logit Arithmetic
Adaptive Knowledge Transfer Optimization
Extending to the Fusion of Multiple SLMs
Experiments
Datasets
Implementation Details
Baselines
...and 20 more sections

Figures (6)

Figure 1: Comparison between our work and previous work. Previous methods only use pre-tuned parameters $\alpha$ to transfer knowledge from a single expert. In contrast, our method dynamically adjusts the proportion of knowledge transferred from multiple experts at each decoding step during inference.
Figure 2: Compare the pre-defined $\alpha$ with the dynamic $\alpha$ for different tasks.
Figure 3: The variation of $\alpha$ in knowledge transfer for the GSM8K expert.
Figure 4: The variation of $\alpha$ for the four experts during knowledge transfer on an unseen task (MMLU: abstract algebra).
Figure 5: Enhance in-context learning and task arithmetic using our method.
...and 1 more figures

On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion

TL;DR

Abstract

On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (6)