Table of Contents
Fetching ...

Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Kaixin Ding, Yang Zhou, Xi Chen, Miao Yang, Jiarong Ou, Rui Chen, Xin Tao, Hengshuang Zhao

TL;DR

Alchemist tackles the data-quality bottleneck in text-to-image training by introducing a meta-gradient framework that learns per-sample data weights via a lightweight rater and a bilevel optimization signal. It couples this rating with Shift-Gsample data pruning to retain a mid-to-late, informative subset, achieving data efficiency gains and faster training without sacrificing quality. Across multiple model families and data domains, Alchemist consistently outperforms random data selection, with 50% retention matching full-data results and substantial speedups. The approach emphasizes gradient-aligned sample influence, multi-granularity perception, and effective pruning to balance convergence speed and final performance, enabling scalable data curation for T2I models.

Abstract

Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.

Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

TL;DR

Alchemist tackles the data-quality bottleneck in text-to-image training by introducing a meta-gradient framework that learns per-sample data weights via a lightweight rater and a bilevel optimization signal. It couples this rating with Shift-Gsample data pruning to retain a mid-to-late, informative subset, achieving data efficiency gains and faster training without sacrificing quality. Across multiple model families and data domains, Alchemist consistently outperforms random data selection, with 50% retention matching full-data results and substantial speedups. The approach emphasizes gradient-aligned sample influence, multi-granularity perception, and effective pruning to balance convergence speed and final performance, enabling scalable data curation for T2I models.

Abstract

Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.

Paper Structure

This paper contains 11 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overall pipeline of Alchemist. In the initial data rating stage (a), the rater predicts a classification score for each image based on gradient extracted from a T2I proxy model. The rater and the proxy model are jointly optimized through weighted loss and total loss. In the data pruning stage (b), we introduce the Shift-Gsample strategy to efficiently retain informative samples while filtering out redundant data and outliers. The resulting Alchemist-selected dataset enables highly efficient training of downstream text-to-image models.
  • Figure 2: Loss and gradient norm across different rating score ranges. For each training sample, we record its instantaneous loss and gradient norm at each training step during STAR-0.3B training. We track the evolution of loss and gradient norm over epochs.
  • Figure 3: Representative examples of data distribution across score regions. The head region mainly contains plain samples, the middle-to-late region contains informative samples, and the tail region contains chaotic samples. Alchemist-selected data aligns with human intuition, filtering out most plain and chaotic samples.
  • Figure 4: Image distribution of Alchemist-selected LAION data subsets. Samples are sorted from high to low scores. The early portion mainly contains images with white or plain backgrounds, the middle portion is more informative and content-rich, and the tail portion gradually becomes noisier, containing unclear content, multiple objects, or visually dense regions.
  • Figure 5: Performance of models trained on 6M and 15M data vs training time. Evaluations are conducted on MJHQ-30K benchmark.