Table of Contents
Fetching ...

A Practitioner's Guide to Continual Multimodal Pretraining

Karsten Roth, Vishaal Udandarao, Sebastian Dziadzio, Ameya Prabhu, Mehdi Cherti, Oriol Vinyals, Olivier Hénaff, Samuel Albanie, Matthias Bethge, Zeynep Akata

TL;DR

This work addresses the challenge of keeping multimodal foundation models up-to-date under realistic, minor-update deployment scenarios. It introduces FoMo-in-Flux, a large, controllable benchmark with 63 datasets and Memory-Adjusted-FLOPs to study long-horizon continual multimodal pretraining, and provides a comprehensive, data-, method-, and recipe-centered analysis. Key findings show that model merging offers the most favorable accumulation-retention trade-offs across update horizons, learning-rate meta-schedules both bolster retention and knowledge gain, larger models aid long-term adaptation, and replaying buffered data is crucial for stable continual updates. The results yield practical guidelines for real-world deployment, including when to use major versus minor updates, how to schedule learning rates across tasks, and how to allocate compute and data across adaptation, pretraining, and buffering to minimize forgetting while maximizing knowledge gain.

Abstract

Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications often demand adaptation to specific subdomains, tasks or concepts -- spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed as well as provide comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) A data-centric investigation of data mixtures and stream orderings that emulate real-world deployment situations, (2) a method-centric investigation ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta learning rate schedules and mechanistic design choices, and (4) the influence of model and compute scaling. Together, our insights provide a practitioner's guide to continual multimodal pretraining for real-world deployment. Our benchmark and code is here: https://github.com/ExplainableML/fomo_in_flux.

A Practitioner's Guide to Continual Multimodal Pretraining

TL;DR

This work addresses the challenge of keeping multimodal foundation models up-to-date under realistic, minor-update deployment scenarios. It introduces FoMo-in-Flux, a large, controllable benchmark with 63 datasets and Memory-Adjusted-FLOPs to study long-horizon continual multimodal pretraining, and provides a comprehensive, data-, method-, and recipe-centered analysis. Key findings show that model merging offers the most favorable accumulation-retention trade-offs across update horizons, learning-rate meta-schedules both bolster retention and knowledge gain, larger models aid long-term adaptation, and replaying buffered data is crucial for stable continual updates. The results yield practical guidelines for real-world deployment, including when to use major versus minor updates, how to schedule learning rates across tasks, and how to allocate compute and data across adaptation, pretraining, and buffering to minimize forgetting while maximizing knowledge gain.

Abstract

Multimodal foundation models serve numerous applications at the intersection of vision and language. Still, despite being pretrained on extensive data, they become outdated over time. To keep models updated, research into continual pretraining mainly explores scenarios with either (1) infrequent, indiscriminate updates on large-scale new data, or (2) frequent, sample-level updates. However, practical model deployment often operates in the gap between these two limit cases, as real-world applications often demand adaptation to specific subdomains, tasks or concepts -- spread over the entire, varying life cycle of a model. In this work, we complement current perspectives on continual pretraining through a research test bed as well as provide comprehensive guidance for effective continual model updates in such scenarios. We first introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements, constructed over 63 datasets with diverse visual and semantic coverage. Using FoMo-in-Flux, we explore the complex landscape of practical continual pretraining through multiple perspectives: (1) A data-centric investigation of data mixtures and stream orderings that emulate real-world deployment situations, (2) a method-centric investigation ranging from simple fine-tuning and traditional continual learning strategies to parameter-efficient updates and model merging, (3) meta learning rate schedules and mechanistic design choices, and (4) the influence of model and compute scaling. Together, our insights provide a practitioner's guide to continual multimodal pretraining for real-world deployment. Our benchmark and code is here: https://github.com/ExplainableML/fomo_in_flux.
Paper Structure (65 sections, 8 equations, 20 figures, 5 tables)

This paper contains 65 sections, 8 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: FoMo-In-Flux pipeline.(Pretraining) We start from pretrained CLIP $\theta_{0}$ and its pretraining pool $\mathcal{P}$. (Update steps) At each step $t$, we sample training instances $\mathcal{S}_{t}$ from $\mathcal{P}$, current update pool $\mathcal{D}_{t}$, and memory buffer $\mathcal{B}$ (containing all past $\mathcal{D}_{t}s$), and train for a fixed compute budget ($F$ MAFs).
  • Figure 2: Visualisation of generated captions. We showcase some sample captions generated using our two-stage pipeline for fine-grained classes (birds from Birdsnap birdsnap), and general, coarse classes (taken from SUN397 sun397). The generated captions combine both image descriptions as well as important semantic class information.
  • Figure 3: Visualisation of programmatically generated captions for Shapes3D shapes3d (right) and DSprites dsprites (left, black and white). Chosen at random, some captions are complete with exact details, while some only have more generic descriptors. Caption style leverages templates generated by GPT-4. The default resolution of these images is $64\times{}64$, hence the low-resolution appearance.
  • Figure 4: Examples of our generated obscure things and animals along with captions, covering $100$ rare and uncommonly occurring things and animals. For each class, images are generated using either Kandinsky-2.1 razzhigaev2023kandinsky, Stable Diffusion 2.1 stablediffusion or Dreamlike-PhotoReal dreamlike_photoreal.
  • Figure 5: Pictographic visualization of different data stream orderings included within the FoMo-in-Flux benchmark setup.
  • ...and 15 more figures