Dataset Growth
Ziheng Qin, Zhaopan Xu, Yukun Zhou, Zangwei Zheng, Zebang Cheng, Hao Tang, Lei Shang, Baigui Sun, Xiaojiang Peng, Radu Timofte, Hongxun Yao, Kai Wang, Yang You
TL;DR
InfoGrowth tackles the infeasibility of manually cleaning web-scale data by introducing an online data-growth framework that jointly cleans and selects informative samples. It combines a multimodal encoder-based cleaner, a gain calculator leveraging online near-neighbor search, and a sampling strategy to maintain cleanliness and diversity as data streams in, achieving substantial data-efficiency improvements ($2\sim4$×) for both multimodal and single-modal tasks. The method’s effectiveness is demonstrated on CC3M and ImageNet-1K across vision-language and vision-only benchmarks, with ablations confirming the value of cleaning and gain-based sampling, and scalability considerations suggesting extension to billion-scale data via distributed ANN. Overall, InfoGrowth offers a practical, scalable solution for maintaining high-quality datasets in the face of accelerating data growth, with implications for more efficient and sustainable AI deployment.
Abstract
Deep learning benefits from the growing abundance of available data. Meanwhile, efficiently dealing with the growing data scale has become a challenge. Data publicly available are from different sources with various qualities, and it is impractical to do manual cleaning against noise and redundancy given today's data scale. There are existing techniques for cleaning/selecting the collected data. However, these methods are mainly proposed for offline settings that target one of the cleanness and redundancy problems. In practice, data are growing exponentially with both problems. This leads to repeated data curation with sub-optimal efficiency. To tackle this challenge, we propose InfoGrowth, an efficient online algorithm for data cleaning and selection, resulting in a growing dataset that keeps up to date with awareness of cleanliness and diversity. InfoGrowth can improve data quality/efficiency on both single-modal and multi-modal tasks, with an efficient and scalable design. Its framework makes it practical for real-world data engines.
