Table of Contents
Fetching ...

How to Merge Your Multimodal Models Over Time?

Sebastian Dziadzio, Vishaal Udandarao, Karsten Roth, Ameya Prabhu, Zeynep Akata, Samuel Albanie, Matthias Bethge

TL;DR

This work tackles the challenge of merging multiple multimodal expert models as new tasks arrive over time, formalizing temporal model merging with the TIME framework that separates initialization, deployment, and merging techniques. Through a large-scale study on the FoMo-in-Flux benchmark, the authors show that time-aware design choices, especially initialization and deployment, drive performance far more than the specific merging rule, and that offline strategies falter without temporal considerations. They introduce Best-in-TIME, an EMA-based initialization/deployment strategy that consistently balances knowledge accumulation and retention and scales favorably with model size and compute, approaching multitask upper bounds in many settings. The findings offer practical guidance for continual multimodal pretraining and establish a systematic baseline for future temporal merging research.

Abstract

Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches assume that all experts are available simultaneously. In reality, new tasks and domains emerge progressively over time, requiring strategies to integrate the knowledge of expert models as they become available: a process we call temporal model merging. The temporal dimension introduces unique challenges not addressed in prior work, raising new questions such as: when training for a new task, should the expert model start from the merged past experts or from the original base model? Should we merge all models at each time step? Which merging techniques are best suited for temporal merging? Should different strategies be used to initialize the training and deploy the model? To answer these questions, we propose a unified framework called TIME - Temporal Integration of Model Expertise - which defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. Using TIME, we study temporal model merging across model sizes, compute budgets, and learning horizons on the FoMo-in-Flux benchmark. Our comprehensive suite of experiments across TIME allows us to uncover key insights for temporal model merging, offering a better understanding of current challenges and best practices for effective temporal model merging.

How to Merge Your Multimodal Models Over Time?

TL;DR

This work tackles the challenge of merging multiple multimodal expert models as new tasks arrive over time, formalizing temporal model merging with the TIME framework that separates initialization, deployment, and merging techniques. Through a large-scale study on the FoMo-in-Flux benchmark, the authors show that time-aware design choices, especially initialization and deployment, drive performance far more than the specific merging rule, and that offline strategies falter without temporal considerations. They introduce Best-in-TIME, an EMA-based initialization/deployment strategy that consistently balances knowledge accumulation and retention and scales favorably with model size and compute, approaching multitask upper bounds in many settings. The findings offer practical guidance for continual multimodal pretraining and establish a systematic baseline for future temporal merging research.

Abstract

Model merging combines multiple expert models - finetuned from a base foundation model on diverse tasks and domains - into a single, more capable model. However, most existing model merging approaches assume that all experts are available simultaneously. In reality, new tasks and domains emerge progressively over time, requiring strategies to integrate the knowledge of expert models as they become available: a process we call temporal model merging. The temporal dimension introduces unique challenges not addressed in prior work, raising new questions such as: when training for a new task, should the expert model start from the merged past experts or from the original base model? Should we merge all models at each time step? Which merging techniques are best suited for temporal merging? Should different strategies be used to initialize the training and deploy the model? To answer these questions, we propose a unified framework called TIME - Temporal Integration of Model Expertise - which defines temporal model merging across three axes: (1) Initialization Phase, (2) Deployment Phase, and (3) Merging Technique. Using TIME, we study temporal model merging across model sizes, compute budgets, and learning horizons on the FoMo-in-Flux benchmark. Our comprehensive suite of experiments across TIME allows us to uncover key insights for temporal model merging, offering a better understanding of current challenges and best practices for effective temporal model merging.

Paper Structure

This paper contains 21 sections, 6 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Temporal Model Merging generalizes standard model merging (yellow), which merges multiple trained experts just once, in a single step. Our systematic study of this realistic multi-step regime reveals that initialization and deployment strategies dominate the importance of the single-step weight merging strategy.
  • Figure 2: Design Space of Temporal Model Merging through TIME. We showcase our framework for the per-task pipeline of temporal model merging over multiple tasks: At each task $t$, we first initialize the current checkpoint to start training from, $\theta_{t}^{i}$, by using one or more previously stored checkpoints from previous tasks, either directly or by merging them. We train $\theta_{t}^{i}$ on current task data $\mathcal{D}_{t}$ to yield the current task checkpoint $\theta_{t}^{s}$, which is inserted into the checkpoint buffer. Finally, to produce the output model, $\theta_{t}^{o}$, we either merge previously stored checkpoints from the buffer or use them directly. The entire framework is depicted in the pseudo-code on the right panel.
  • Figure 3: Offline merging methods struggle with TIME. All tested merging techniques perform extremely poorly, and are unable to adapt to the temporal setting, underperforming even a simple replay baseline that sequentially trains the base model on task-replayed data.
  • Figure 4: Improving offline merging. We identify two simple methods for adapting offline-merging methods to the temporal setting: (1) replaying data from previous tasks (best-(offline+replay)) and (2) recency-biased weighting of task checkpoints (best-(offline+replay+weighting)). With these method improvements, offline merging methods can match the replay baseline.
  • Figure 5: A journey through TIME. We explore various initialization and deployment protocols, finding that the EMA initialization-deployment strikes the best balance between knowledge accumulation and zero-shot retention. We refer to this strategy as Best-in-TIME.
  • ...and 8 more figures