Exploring the Capabilities of Large Language Models for Generating Diverse Design Solutions
Kevin Ma, Daniele Grandi, Christopher McComb, Kosa Goucher-Lambert
TL;DR
This work analyzes how large language models (GPT-4) generate diverse design solutions for early-stage ideation and how prompt engineering and parameter settings influence this diversity. By generating thousands of LLM-derived designs and comparing them to 100 crowdsourced solutions across five topics using multiple embedding-based diversity metrics, the authors reveal that humans consistently provide more diverse sets, though certain LLM configurations—notably temperature = 1 and top-$P$ = 1 and critique-based prompting—enhance diversity. A logistic regression analysis suggests topic-dependent semantic differences between human- and LLM-generated solutions, indicating that LLMs and crowdsourcing may offer complementary strengths. The study provides practical guidance for calibrating LLMs to support ideation and highlights potential hybrid workflows that combine AI-generated inspiration with human diversity to maximize design exploration.
Abstract
Access to large amounts of diverse design solutions can support designers during the early stage of the design process. In this paper, we explore the efficacy of large language models (LLM) in producing diverse design solutions, investigating the level of impact that parameter tuning and various prompt engineering techniques can have on the diversity of LLM-generated design solutions. Specifically, LLMs are used to generate a total of 4,000 design solutions across five distinct design topics, eight combinations of parameters, and eight different types of prompt engineering techniques, comparing each combination of parameter and prompt engineering method across four different diversity metrics. LLM-generated solutions are compared against 100 human-crowdsourced solutions in each design topic using the same set of diversity metrics. Results indicate that human-generated solutions consistently have greater diversity scores across all design topics. Using a post hoc logistic regression analysis we investigate whether these differences primarily exist at the semantic level. Results show that there is a divide in some design topics between humans and LLM-generated solutions, while others have no clear divide. Taken together, these results contribute to the understanding of LLMs' capabilities in generating a large volume of diverse design solutions and offer insights for future research that leverages LLMs to generate diverse design solutions for a broad range of design tasks (e.g., inspirational stimuli).
