GreenStableYolo: Optimizing Inference Time and Image Quality of Text-to-Image Generation

Jingzhi Gong; Sisi Li; Giordano d'Aloisio; Zishuo Ding; Yulong Ye; William B. Langdon; Federica Sarro

GreenStableYolo: Optimizing Inference Time and Image Quality of Text-to-Image Generation

Jingzhi Gong, Sisi Li, Giordano d'Aloisio, Zishuo Ding, Yulong Ye, William B. Langdon, Federica Sarro

TL;DR

GreenStableYolo tackles the trade-off between inference time and image quality in text-to-image generation by casting the tuning of prompts and parameters as a multi-objective optimization problem solved with $NSGA$-$II$. It measures inference time as GPU time and image quality via a YOLO-based object-matching metric, guiding the search with the pareto-front over parameters such as inference steps, guidance scale, and prompts. The approach reports substantial latency reductions and improved hypervolume relative to StableYolo, with a modest drop in image quality, and analyzes parameter importance to inform practical tuning. The work demonstrates the practical potential of NSGA-II driven optimization for efficiency-aware GenAI deployment and suggests broader applicability to other diffusion-based systems and energy-focused metrics.

Abstract

Tuning the parameters and prompts for improving AI-based text-to-image generation has remained a substantial yet unaddressed challenge. Hence we introduce GreenStableYolo, which improves the parameters and prompts for Stable Diffusion to both reduce GPU inference time and increase image generation quality using NSGA-II and Yolo. Our experiments show that despite a relatively slight trade-off (18%) in image quality compared to StableYolo (which only considers image quality), GreenStableYolo achieves a substantial reduction in inference time (266% less) and a 526% higher hypervolume, thereby advancing the state-of-the-art for text-to-image generation.

GreenStableYolo: Optimizing Inference Time and Image Quality of Text-to-Image Generation

TL;DR

. It measures inference time as GPU time and image quality via a YOLO-based object-matching metric, guiding the search with the pareto-front over parameters such as inference steps, guidance scale, and prompts. The approach reports substantial latency reductions and improved hypervolume relative to StableYolo, with a modest drop in image quality, and analyzes parameter importance to inform practical tuning. The work demonstrates the practical potential of NSGA-II driven optimization for efficiency-aware GenAI deployment and suggests broader applicability to other diffusion-based systems and energy-focused metrics.

Abstract

Paper Structure (5 sections, 2 figures)

This paper contains 5 sections, 2 figures.

Introduction
Related Work
Methodology
Evaluation
Conclusion

Figures (2)

Figure 1: Comparison of GreenStableYolo and StableYolo on 15 independent runs
Figure 2: Parameters/prompts importance based on the mean decrease in impurity

GreenStableYolo: Optimizing Inference Time and Image Quality of Text-to-Image Generation

TL;DR

Abstract

GreenStableYolo: Optimizing Inference Time and Image Quality of Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)