Table of Contents
Fetching ...

TiC-CLIP: Continual Training of CLIP Models

Saurabh Garg, Mehrdad Farajtabar, Hadi Pouransari, Raviteja Vemulapalli, Sachin Mehta, Oncel Tuzel, Vaishaal Shankar, Fartash Faghri

TL;DR

The first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps are introduced and a simple rehearsal-based approach is demonstrated that reduces compute by $2.5\times when compared to the standard practice of retraining from scratch.

Abstract

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps. TiC-DataComp, our largest dataset, contains over 12.7B timestamped image-text pairs spanning 9 years (2014-2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses $\approx 8\%$ zero-shot accuracy on our curated retrieval task from 2021-2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by $2.5\times$ when compared to the standard practice of retraining from scratch. Code is available at https://github.com/apple/ml-tic-clip.

TiC-CLIP: Continual Training of CLIP Models

TL;DR

The first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps are introduced and a simple rehearsal-based approach is demonstrated that reduces compute by $2.5\times when compared to the standard practice of retraining from scratch.

Abstract

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps. TiC-DataComp, our largest dataset, contains over 12.7B timestamped image-text pairs spanning 9 years (2014-2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses zero-shot accuracy on our curated retrieval task from 2021-2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by when compared to the standard practice of retraining from scratch. Code is available at https://github.com/apple/ml-tic-clip.
Paper Structure (38 sections, 1 equation, 22 figures, 20 tables)

This paper contains 38 sections, 1 equation, 22 figures, 20 tables.

Figures (22)

  • Figure 1: (Left, Middle)OpenAI models show less zero-shot robustness on retrieval task from 2021--2022. OpenCLIP models and OpenAI models have similar robustness on standard benchmarks. However, OpenAI models show less robustness on our retrieval task when compared with recent models in OpenCLIP repository, highlighting susceptibility to a time-evolving data distribution (Right)Simple continual training baseline is computationally efficient and competitive to retraining from scratch. Different points denote models trained sequentially on our TiC-DataComp (L) as data arrives over time. Warm start training with previous checkpoint and replaying all old data, performs similar to Oracle which trains from scratch every time new data arrives, by using $2.7\times$ less compute.
  • Figure 1: Table summarizing our methods. $D$: data size in each step, $T$ total time steps, $t$: current time step, $C$: compute budget (iterations).
  • Figure 2: Experimental protocol on our proposed continual benchmarks.(A) Combine new and old data given buffer constraints. (B) Continually train a model with a compute budget (say $C$) either by starting with previous checkpoint or from scratch. (C) Evaluate models on standard datasets and our proposed dynamic datasets. Comparison with other benchmarks in \ref{['app:cl_benchmarks']}.
  • Figure 3: Distribution of examples changes from 2014 to 2022 in our dynamic evaluation tasks.(Left) Samples for text to image retrieval. For new timestamps, images from novel concepts appear (e.g., COVID-19). (Right) Samples from our classification task for 4 categories. We observe that not only objects evolve over time but also images from recent timestamps are captured more in the wild.
  • Figure 3: ImageNet continual training. Cumulative-All remains close to Oracle.
  • ...and 17 more figures