Data Processing Techniques for Modern Multimodal Models
Yinheng Li, Han Ding, Hang Chen
TL;DR
The paper tackles how data processing shapes the training of modern multimodal models, with a focus on diffusion-based image generation and multimodal large language models (MLLMs). It proposes a pragmatic framework that classifies techniques into data quality, distribution, and safety, while treating data quantity as contingent on data sources; data quantity is not exhaustively covered. Through surveys of filtering, augmentation, balancing, and safety methods, it highlights that diffusion models prioritize image quality on large-scale data with safety filters, whereas MLLMs emphasize text quality and image-text alignment using curated data, aided by model-based filtering. The work underscores an iterative data-processing pipeline where model feedback guides data curation, and it argues for tailoring techniques to task and architecture, with humans still playing a crucial role for high-quality finetuning. Overall, it provides practical guidance for practitioners to design robust, fair, and effective multimodal training pipelines.
Abstract
Data processing plays an significant role in current multimodal model training. In this paper. we provide an comprehensive review of common data processing techniques used in modern multimodal model training with a focus on diffusion models and multimodal large language models (MLLMs). We summarized all techniques into four categories: data quality, data quantity, data distribution and data safety. We further present our findings in the choice of data process methods in different type of models. This study aims to provide guidance to multimodal models developers with effective data processing techniques.
