Table of Contents
Fetching ...

Dataset Growth

Ziheng Qin, Zhaopan Xu, Yukun Zhou, Zangwei Zheng, Zebang Cheng, Hao Tang, Lei Shang, Baigui Sun, Xiaojiang Peng, Radu Timofte, Hongxun Yao, Kai Wang, Yang You

TL;DR

InfoGrowth tackles the infeasibility of manually cleaning web-scale data by introducing an online data-growth framework that jointly cleans and selects informative samples. It combines a multimodal encoder-based cleaner, a gain calculator leveraging online near-neighbor search, and a sampling strategy to maintain cleanliness and diversity as data streams in, achieving substantial data-efficiency improvements ($2\sim4$×) for both multimodal and single-modal tasks. The method’s effectiveness is demonstrated on CC3M and ImageNet-1K across vision-language and vision-only benchmarks, with ablations confirming the value of cleaning and gain-based sampling, and scalability considerations suggesting extension to billion-scale data via distributed ANN. Overall, InfoGrowth offers a practical, scalable solution for maintaining high-quality datasets in the face of accelerating data growth, with implications for more efficient and sustainable AI deployment.

Abstract

Deep learning benefits from the growing abundance of available data. Meanwhile, efficiently dealing with the growing data scale has become a challenge. Data publicly available are from different sources with various qualities, and it is impractical to do manual cleaning against noise and redundancy given today's data scale. There are existing techniques for cleaning/selecting the collected data. However, these methods are mainly proposed for offline settings that target one of the cleanness and redundancy problems. In practice, data are growing exponentially with both problems. This leads to repeated data curation with sub-optimal efficiency. To tackle this challenge, we propose InfoGrowth, an efficient online algorithm for data cleaning and selection, resulting in a growing dataset that keeps up to date with awareness of cleanliness and diversity. InfoGrowth can improve data quality/efficiency on both single-modal and multi-modal tasks, with an efficient and scalable design. Its framework makes it practical for real-world data engines.

Dataset Growth

TL;DR

InfoGrowth tackles the infeasibility of manually cleaning web-scale data by introducing an online data-growth framework that jointly cleans and selects informative samples. It combines a multimodal encoder-based cleaner, a gain calculator leveraging online near-neighbor search, and a sampling strategy to maintain cleanliness and diversity as data streams in, achieving substantial data-efficiency improvements (×) for both multimodal and single-modal tasks. The method’s effectiveness is demonstrated on CC3M and ImageNet-1K across vision-language and vision-only benchmarks, with ablations confirming the value of cleaning and gain-based sampling, and scalability considerations suggesting extension to billion-scale data via distributed ANN. Overall, InfoGrowth offers a practical, scalable solution for maintaining high-quality datasets in the face of accelerating data growth, with implications for more efficient and sustainable AI deployment.

Abstract

Deep learning benefits from the growing abundance of available data. Meanwhile, efficiently dealing with the growing data scale has become a challenge. Data publicly available are from different sources with various qualities, and it is impractical to do manual cleaning against noise and redundancy given today's data scale. There are existing techniques for cleaning/selecting the collected data. However, these methods are mainly proposed for offline settings that target one of the cleanness and redundancy problems. In practice, data are growing exponentially with both problems. This leads to repeated data curation with sub-optimal efficiency. To tackle this challenge, we propose InfoGrowth, an efficient online algorithm for data cleaning and selection, resulting in a growing dataset that keeps up to date with awareness of cleanliness and diversity. InfoGrowth can improve data quality/efficiency on both single-modal and multi-modal tasks, with an efficient and scalable design. Its framework makes it practical for real-world data engines.
Paper Structure (23 sections, 3 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 23 sections, 3 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Data grows rapidly on an exponential scale. A dataset with cleanness and diversity leads to better data efficiency and training results.
  • Figure 2: Pipeline of InfoGrowth. Streaming data first goes through cleaner, then gain calculator, and finally selector.
  • Figure 3: InfoGrowth demonstrates better training results than using original CC3M and MiniGPT4 recaptioned CC3M at the same amount of data. Its cost is also much lower than recaptioning all data.
  • Figure 4: Gain decay with more data collected.
  • Figure 5: Redundant samples detected in CC3M and their cosine similarity.
  • ...and 1 more figures