Table of Contents
Fetching ...

InstructEngine: Instruction-driven Text-to-Image Alignment

Xingyu Lu, Yuhang Hu, YiFan Zhang, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Jinpeng Wang, Chun Yuan, Bin Wen, Fan Yang, Tingting Gao, Di Zhang

TL;DR

The paper tackles data and algorithmic bottlenecks in RLHF/RLAIF-based text-to-image alignment by introducing InstructEngine, an instruction-driven framework that combines a text-to-image taxonomy, automated preference data construction, and a cross-validation alignment training scheme. It builds 25K preference-pair samples via a taxonomy-guided pipeline and refines training with cross-validated triples to improve data efficiency, achieving notable gains on SD v1.5 and SDXL in DrawBench and surpassing baselines in human evaluations. The results demonstrate that instruction-centric alignment can achieve strong, data-efficient performance with interpretable preferences, reducing reliance on costly manual annotation and biased reward models. Overall, InstructEngine offers a practical pathway toward scalable, human-aligned text-to-image generation with improved aesthetic and semantic coherence.

Abstract

Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF) has been extensively utilized for preference alignment of text-to-image models. Existing methods face certain limitations in terms of both data and algorithm. For training data, most approaches rely on manual annotated preference data, either by directly fine-tuning the generators or by training reward models to provide training signals. However, the high annotation cost makes them difficult to scale up, the reward model consumes extra computation and cannot guarantee accuracy. From an algorithmic perspective, most methods neglect the value of text and only take the image feedback as a comparative signal, which is inefficient and sparse. To alleviate these drawbacks, we propose the InstructEngine framework. Regarding annotation cost, we first construct a taxonomy for text-to-image generation, then develop an automated data construction pipeline based on it. Leveraging advanced large multimodal models and human-defined rules, we generate 25K text-image preference pairs. Finally, we introduce cross-validation alignment method, which refines data efficiency by organizing semantically analogous samples into mutually comparable pairs. Evaluations on DrawBench demonstrate that InstructEngine improves SD v1.5 and SDXL's performance by 10.53% and 5.30%, outperforming state-of-the-art baselines, with ablation study confirming the benefits of InstructEngine's all components. A win rate of over 50% in human reviews also proves that InstructEngine better aligns with human preferences.

InstructEngine: Instruction-driven Text-to-Image Alignment

TL;DR

The paper tackles data and algorithmic bottlenecks in RLHF/RLAIF-based text-to-image alignment by introducing InstructEngine, an instruction-driven framework that combines a text-to-image taxonomy, automated preference data construction, and a cross-validation alignment training scheme. It builds 25K preference-pair samples via a taxonomy-guided pipeline and refines training with cross-validated triples to improve data efficiency, achieving notable gains on SD v1.5 and SDXL in DrawBench and surpassing baselines in human evaluations. The results demonstrate that instruction-centric alignment can achieve strong, data-efficient performance with interpretable preferences, reducing reliance on costly manual annotation and biased reward models. Overall, InstructEngine offers a practical pathway toward scalable, human-aligned text-to-image generation with improved aesthetic and semantic coherence.

Abstract

Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF) has been extensively utilized for preference alignment of text-to-image models. Existing methods face certain limitations in terms of both data and algorithm. For training data, most approaches rely on manual annotated preference data, either by directly fine-tuning the generators or by training reward models to provide training signals. However, the high annotation cost makes them difficult to scale up, the reward model consumes extra computation and cannot guarantee accuracy. From an algorithmic perspective, most methods neglect the value of text and only take the image feedback as a comparative signal, which is inefficient and sparse. To alleviate these drawbacks, we propose the InstructEngine framework. Regarding annotation cost, we first construct a taxonomy for text-to-image generation, then develop an automated data construction pipeline based on it. Leveraging advanced large multimodal models and human-defined rules, we generate 25K text-image preference pairs. Finally, we introduce cross-validation alignment method, which refines data efficiency by organizing semantically analogous samples into mutually comparable pairs. Evaluations on DrawBench demonstrate that InstructEngine improves SD v1.5 and SDXL's performance by 10.53% and 5.30%, outperforming state-of-the-art baselines, with ablation study confirming the benefits of InstructEngine's all components. A win rate of over 50% in human reviews also proves that InstructEngine better aligns with human preferences.

Paper Structure

This paper contains 15 sections, 9 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Differences in preference modeling: Previous alignment methods convey preferences through preferred/disliked images or reward model, InstructEngine framework encodes fine-grained preference information in three dimensions through text, making the injected preferences understandable by humans.
  • Figure 2: We propose InstructEngine, a text-to-image alignment framework injecting preference information through contrasting instructions. After training with our preference data and alignment method, the SDXL model generates images that are more realistic and align better with human aesthetic preferences. We present generation results across human, animal, artwork, food, and landscape.
  • Figure 3: Visualization of themes in InstructEngine's taxonomy, we divide them into five categories for easier demonstration.
  • Figure 4: Construction pipeline of InstructEngine's preference data: We build a taxonomy for text-to-image instructions with Text Sampling, Theme Expansion and Subtopic Division. Based on the entities in the taxonomy, we first inject three kinds of preference information to generate fine-grained preference instructions, then generate consistent images for text-to-image alignment.
  • Figure 5: Comparison of data efficiency across three datasets. The gray dashed line represents the performance of the original model.
  • ...and 2 more figures