GreenStableYolo: Optimizing Inference Time and Image Quality of Text-to-Image Generation
Jingzhi Gong, Sisi Li, Giordano d'Aloisio, Zishuo Ding, Yulong Ye, William B. Langdon, Federica Sarro
TL;DR
GreenStableYolo tackles the trade-off between inference time and image quality in text-to-image generation by casting the tuning of prompts and parameters as a multi-objective optimization problem solved with $NSGA$-$II$. It measures inference time as GPU time and image quality via a YOLO-based object-matching metric, guiding the search with the pareto-front over parameters such as inference steps, guidance scale, and prompts. The approach reports substantial latency reductions and improved hypervolume relative to StableYolo, with a modest drop in image quality, and analyzes parameter importance to inform practical tuning. The work demonstrates the practical potential of NSGA-II driven optimization for efficiency-aware GenAI deployment and suggests broader applicability to other diffusion-based systems and energy-focused metrics.
Abstract
Tuning the parameters and prompts for improving AI-based text-to-image generation has remained a substantial yet unaddressed challenge. Hence we introduce GreenStableYolo, which improves the parameters and prompts for Stable Diffusion to both reduce GPU inference time and increase image generation quality using NSGA-II and Yolo. Our experiments show that despite a relatively slight trade-off (18%) in image quality compared to StableYolo (which only considers image quality), GreenStableYolo achieves a substantial reduction in inference time (266% less) and a 526% higher hypervolume, thereby advancing the state-of-the-art for text-to-image generation.
