Table of Contents
Fetching ...

Pack of LLMs: Model Fusion at Test-Time via Perplexity Optimization

Costas Mavromatis, Petros Karypis, George Karypis

TL;DR

PackLLM presents a modular, training-free approach for fusing arbitrary LLMs at test-time by weighting model logits to minimize prompt perplexity. It introduces two variants: PackLLM_sim, a simple perplexity-based weighting, and PackLLM_opt, a greedy optimization that directly minimizes perplexity with a scalable procedure. The method includes practical tokenizer alignment (MinED) to fuse models with different vocabularies. Empirical results across language modeling and diverse downstream tasks show PackLLM outperforms test-time baselines and surpasses learning-based fusion methods when incorporating recently released LLMs, with notable accuracy gains and fewer parameters. The work demonstrates that perplexity is a reliable indicator of LLM expertise for fusion and enables rapid, modular integration of new models without retraining.

Abstract

Fusing knowledge from multiple Large Language Models (LLMs) can combine their diverse strengths to achieve improved performance on a given task. However, current fusion approaches either rely on learning-based fusers that do not generalize to new LLMs, or do not take into account how well each LLM understands the input. In this work, we study LLM fusion at test-time, which enables leveraging knowledge from arbitrary user-specified LLMs during inference. We introduce Pack of LLMs (PackLLM), an effective method for test-time fusion that leverages each LLM's expertise, given an input prompt. PackLLM performs model fusion by solving an optimization problem for determining each LLM's importance, so that perplexity over the input prompt is minimized. First, our simple PackLLM-sim variant validates that perplexity is a good indicator for measuring each LLM's expertise. Second, our PackLLM-opt variant approximately solves the perplexity minimization problem via a greedy algorithm. The derived importance weights are used to combine the LLMs during inference. We conduct experiments with over 100 total LLMs on a diverse set of tasks. Experimental results show that (i) perplexity is a reliable measure for LLM fusion, (ii) PackLLM outperforms test-time fusion baselines by 1.89% accuracy points, and (iii) PackLLM can leverage new LLMs to improve performance over learning-based fusion approaches by 3.92-11.94% accuracy points.

Pack of LLMs: Model Fusion at Test-Time via Perplexity Optimization

TL;DR

PackLLM presents a modular, training-free approach for fusing arbitrary LLMs at test-time by weighting model logits to minimize prompt perplexity. It introduces two variants: PackLLM_sim, a simple perplexity-based weighting, and PackLLM_opt, a greedy optimization that directly minimizes perplexity with a scalable procedure. The method includes practical tokenizer alignment (MinED) to fuse models with different vocabularies. Empirical results across language modeling and diverse downstream tasks show PackLLM outperforms test-time baselines and surpasses learning-based fusion methods when incorporating recently released LLMs, with notable accuracy gains and fewer parameters. The work demonstrates that perplexity is a reliable indicator of LLM expertise for fusion and enables rapid, modular integration of new models without retraining.

Abstract

Fusing knowledge from multiple Large Language Models (LLMs) can combine their diverse strengths to achieve improved performance on a given task. However, current fusion approaches either rely on learning-based fusers that do not generalize to new LLMs, or do not take into account how well each LLM understands the input. In this work, we study LLM fusion at test-time, which enables leveraging knowledge from arbitrary user-specified LLMs during inference. We introduce Pack of LLMs (PackLLM), an effective method for test-time fusion that leverages each LLM's expertise, given an input prompt. PackLLM performs model fusion by solving an optimization problem for determining each LLM's importance, so that perplexity over the input prompt is minimized. First, our simple PackLLM-sim variant validates that perplexity is a good indicator for measuring each LLM's expertise. Second, our PackLLM-opt variant approximately solves the perplexity minimization problem via a greedy algorithm. The derived importance weights are used to combine the LLMs during inference. We conduct experiments with over 100 total LLMs on a diverse set of tasks. Experimental results show that (i) perplexity is a reliable measure for LLM fusion, (ii) PackLLM outperforms test-time fusion baselines by 1.89% accuracy points, and (iii) PackLLM can leverage new LLMs to improve performance over learning-based fusion approaches by 3.92-11.94% accuracy points.
Paper Structure (19 sections, 9 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 19 sections, 9 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: PackLLM performs LLM fusion at test-time. PackLLM (i) does not require any training of fusion models, (ii) can leverage new (more powerful) LLMs when released, and (iii) allows users to select their preferred LLMs at test-time.
  • Figure 2: Overview of LLM fusion paradigms. Left: Learning-based fusion approaches are not modular and cannot encompass new LLMs at test-time. Middle: Uniform ensemble may degrade performance, given weak seed LLMs. Right: Our PackLLM approach determines the weights of the seed LLMs at test-time via perplexity to achieve effective performance.
  • Figure 3: Perplexity result comparison (the lower, the better) on C4 and S2ORC datasets.
  • Figure 4: Analysis of PackLLM$_\text{sim}$ and PackLLM$_\text{opt}$.