Table of Contents
Fetching ...

Exploring the Capabilities of Large Language Models for Generating Diverse Design Solutions

Kevin Ma, Daniele Grandi, Christopher McComb, Kosa Goucher-Lambert

TL;DR

This work analyzes how large language models (GPT-4) generate diverse design solutions for early-stage ideation and how prompt engineering and parameter settings influence this diversity. By generating thousands of LLM-derived designs and comparing them to 100 crowdsourced solutions across five topics using multiple embedding-based diversity metrics, the authors reveal that humans consistently provide more diverse sets, though certain LLM configurations—notably temperature = 1 and top-$P$ = 1 and critique-based prompting—enhance diversity. A logistic regression analysis suggests topic-dependent semantic differences between human- and LLM-generated solutions, indicating that LLMs and crowdsourcing may offer complementary strengths. The study provides practical guidance for calibrating LLMs to support ideation and highlights potential hybrid workflows that combine AI-generated inspiration with human diversity to maximize design exploration.

Abstract

Access to large amounts of diverse design solutions can support designers during the early stage of the design process. In this paper, we explore the efficacy of large language models (LLM) in producing diverse design solutions, investigating the level of impact that parameter tuning and various prompt engineering techniques can have on the diversity of LLM-generated design solutions. Specifically, LLMs are used to generate a total of 4,000 design solutions across five distinct design topics, eight combinations of parameters, and eight different types of prompt engineering techniques, comparing each combination of parameter and prompt engineering method across four different diversity metrics. LLM-generated solutions are compared against 100 human-crowdsourced solutions in each design topic using the same set of diversity metrics. Results indicate that human-generated solutions consistently have greater diversity scores across all design topics. Using a post hoc logistic regression analysis we investigate whether these differences primarily exist at the semantic level. Results show that there is a divide in some design topics between humans and LLM-generated solutions, while others have no clear divide. Taken together, these results contribute to the understanding of LLMs' capabilities in generating a large volume of diverse design solutions and offer insights for future research that leverages LLMs to generate diverse design solutions for a broad range of design tasks (e.g., inspirational stimuli).

Exploring the Capabilities of Large Language Models for Generating Diverse Design Solutions

TL;DR

This work analyzes how large language models (GPT-4) generate diverse design solutions for early-stage ideation and how prompt engineering and parameter settings influence this diversity. By generating thousands of LLM-derived designs and comparing them to 100 crowdsourced solutions across five topics using multiple embedding-based diversity metrics, the authors reveal that humans consistently provide more diverse sets, though certain LLM configurations—notably temperature = 1 and top- = 1 and critique-based prompting—enhance diversity. A logistic regression analysis suggests topic-dependent semantic differences between human- and LLM-generated solutions, indicating that LLMs and crowdsourcing may offer complementary strengths. The study provides practical guidance for calibrating LLMs to support ideation and highlights potential hybrid workflows that combine AI-generated inspiration with human diversity to maximize design exploration.

Abstract

Access to large amounts of diverse design solutions can support designers during the early stage of the design process. In this paper, we explore the efficacy of large language models (LLM) in producing diverse design solutions, investigating the level of impact that parameter tuning and various prompt engineering techniques can have on the diversity of LLM-generated design solutions. Specifically, LLMs are used to generate a total of 4,000 design solutions across five distinct design topics, eight combinations of parameters, and eight different types of prompt engineering techniques, comparing each combination of parameter and prompt engineering method across four different diversity metrics. LLM-generated solutions are compared against 100 human-crowdsourced solutions in each design topic using the same set of diversity metrics. Results indicate that human-generated solutions consistently have greater diversity scores across all design topics. Using a post hoc logistic regression analysis we investigate whether these differences primarily exist at the semantic level. Results show that there is a divide in some design topics between humans and LLM-generated solutions, while others have no clear divide. Taken together, these results contribute to the understanding of LLMs' capabilities in generating a large volume of diverse design solutions and offer insights for future research that leverages LLMs to generate diverse design solutions for a broad range of design tasks (e.g., inspirational stimuli).
Paper Structure (28 sections, 1 equation, 3 figures, 6 tables)

This paper contains 28 sections, 1 equation, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Our overall objective is to better understand the ability of LLMs to generate diverse design solutions -- tested across a range of design problems and LLM input parameters. For each design topic, we generated 800 total design solutions using a LLM (GPT-4) across different generative parameters (temperature/top-P) and different prompt engineering techniques. Likewise, for each design topic, we retrieved 100 design solutions via crowdsourcing AmazonTurk workers. All the solutions were then converted into vector embeddings, which were used to measure diversity for quantitative comparisons. This was conducted 5 times across 5 different design problems, leading to a total of 4000 design solutions generated by an LLM and 500 design solutions retrieved via Amazon Mechanical Turk crowdsourcing.
  • Figure 2: Methodology for zero-shot baseline prompting. To generate a total of 50 design solutions, there is an initial input of "Generate 5 design solutions for[design problem]" (see Table \ref{['tab:design_problem']} for list of design problems input and Table \ref{['tab:zero-shot-prompting']} for an example of how the prompts were input). After the LLM (GPT-4 in our case) generates 5 design solutions, they are stored in a data structure. Using the stored data structure, we conditioned the next generation of 5 more design solutions subject to the design solutions already generated as seen in the figure. We performed this loop 9 times until there was a total of 50 design solutions generated.
  • Figure 3: There are four total heatmaps, each representing one method of computing diversity. On the x-axis are the temperature and top-P values (see Table \ref{['tabular:temp-TopP']}, and the y-axis are the corresponding design topics. The tabular value was calculated via percent difference in diversity to 'Human 50 v2' measured for the 50 design solutions (see Section \ref{['results:parameter-impact']} for explanation).