Table of Contents
Fetching ...

Mashup Learning: Faster Finetuning by Remixing Past Checkpoints

Sofia Maria Lo Cicero Vaina, Artem Chumachenko, Max Ryabinin

TL;DR

This paper proposes Mashup Learning, a simple method to leverage the outputs of prior training runs to enhance model adaptation to new tasks, and consistently improves average downstream accuracy over training from scratch.

Abstract

Finetuning on domain-specific data is a well-established method for enhancing LLM performance on downstream tasks. Training on each dataset produces a new set of model weights, resulting in a multitude of checkpoints saved in-house or on open-source platforms. However, these training artifacts are rarely reused for subsequent experiments despite containing improved model abilities for potentially similar tasks. In this paper, we propose Mashup Learning, a simple method to leverage the outputs of prior training runs to enhance model adaptation to new tasks. Our procedure identifies the most relevant historical checkpoints for a target dataset, aggregates them with model merging, and uses the result as an improved initialization for training. Across 8 standard LLM benchmarks, four models, and two collections of source checkpoints, Mashup Learning consistently improves average downstream accuracy by 0.5-5 percentage points over training from scratch. It also accelerates convergence, requiring 41-46% fewer training steps and up to 37% less total wall-clock time to match from-scratch accuracy, including all selection and merging overhead.

Mashup Learning: Faster Finetuning by Remixing Past Checkpoints

TL;DR

This paper proposes Mashup Learning, a simple method to leverage the outputs of prior training runs to enhance model adaptation to new tasks, and consistently improves average downstream accuracy over training from scratch.

Abstract

Finetuning on domain-specific data is a well-established method for enhancing LLM performance on downstream tasks. Training on each dataset produces a new set of model weights, resulting in a multitude of checkpoints saved in-house or on open-source platforms. However, these training artifacts are rarely reused for subsequent experiments despite containing improved model abilities for potentially similar tasks. In this paper, we propose Mashup Learning, a simple method to leverage the outputs of prior training runs to enhance model adaptation to new tasks. Our procedure identifies the most relevant historical checkpoints for a target dataset, aggregates them with model merging, and uses the result as an improved initialization for training. Across 8 standard LLM benchmarks, four models, and two collections of source checkpoints, Mashup Learning consistently improves average downstream accuracy by 0.5-5 percentage points over training from scratch. It also accelerates convergence, requiring 41-46% fewer training steps and up to 37% less total wall-clock time to match from-scratch accuracy, including all selection and merging overhead.
Paper Structure (26 sections, 7 figures, 9 tables, 1 algorithm)

This paper contains 26 sections, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: A schematic example of Mashup Learning.
  • Figure 2: Accuracy on PIQA as a function of the number of samples used for checkpoint selection, evaluated on Mistral-7B-Instruct-v0.2.
  • Figure 3: Mean rank by accuracy of each combination of merging method and number of merged models across 8 benchmarks (ARC-Easy, CommonsenseQA, HellaSwag, MathQA, OpenBookQA, PIQA, SocialIQA, Winogrande) in a leave-one-out setup for Mistral-7B-Instruct-v0.2.
  • Figure 4: Gemma-3 4B LoRA sensitivity of training results to learning rates. Mean accuracy on 8 benchmarks across 3 seeds
  • Figure 5: Gemma-3 1B: per-task learning rate sensitivity.
  • ...and 2 more figures