Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Divyansh Pareek; Sewoong Oh; Simon S. Du

Paper

Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Abstract

The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting

as the fraction of data with correctly matched modalities among

paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering:

the error without filtering is upper and lower bounded by

, and

the error with teacher-based filtering is upper bounded by

in the large

regime, and by

in the small

regime.