Evaluating Large Language Models for Material Selection

Daniele Grandi; Yash Patawari Jain; Allin Groom; Brandon Cramer; Christopher McComb

Evaluating Large Language Models for Material Selection

Daniele Grandi, Yash Patawari Jain, Allin Groom, Brandon Cramer, Christopher McComb

TL;DR

This work addresses how large language models can assist material selection in conceptual design by comparing model-generated assessments to a data-backed expert corpus. It collects 10,544 expert scores across 16 design scenarios and 9 materials, then evaluates GPT-4, Mixtral, and MechGPT under zero-shot, few-shot, parallel prompting, chain-of-thought, and temperature variations, using $z = \frac{x - \mu}{\sigma}$ and $MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$ as key metrics. The results show LLMs exhibit bias toward certain materials and lower variance than experts; parallel prompting generally improves performance and scalability, while chain-of-thought prompting often reduces accuracy. The findings highlight the potential and limitations of LLMs for material selection, emphasizing the need for task-specific prompting and careful integration into design workflows to avoid overestimation and loss of expert diversity.

Abstract

Material selection is a crucial step in conceptual design due to its significant impact on the functionality, aesthetics, manufacturability, and sustainability impact of the final product. This study investigates the use of Large Language Models (LLMs) for material selection in the product design process and compares the performance of LLMs against expert choices for various design scenarios. By collecting a dataset of expert material preferences, the study provides a basis for evaluating how well LLMs can align with expert recommendations through prompt engineering and hyperparameter tuning. The divergence between LLM and expert recommendations is measured across different model configurations, prompt strategies, and temperature settings. This approach allows for a detailed analysis of factors influencing the LLMs' effectiveness in recommending materials. The results from this study highlight two failure modes, and identify parallel prompting as a useful prompt-engineering method when using LLMs for material selection. The findings further suggest that, while LLMs can provide valuable assistance, their recommendations often vary significantly from those of human experts. This discrepancy underscores the need for further research into how LLMs can be better tailored to replicate expert decision-making in material selection. This work contributes to the growing body of knowledge on how LLMs can be integrated into the design process, offering insights into their current limitations and potential for future improvements.

Evaluating Large Language Models for Material Selection

TL;DR

and

as key metrics. The results show LLMs exhibit bias toward certain materials and lower variance than experts; parallel prompting generally improves performance and scalability, while chain-of-thought prompting often reduces accuracy. The findings highlight the potential and limitations of LLMs for material selection, emphasizing the need for task-specific prompting and careful integration into design workflows to avoid overestimation and loss of expert diversity.

Abstract

Paper Structure (25 sections, 2 equations, 7 figures, 3 tables)

This paper contains 25 sections, 2 equations, 7 figures, 3 tables.

Introduction
Background
Challenges in Material Selection
Evaluating Large Language Models
Automating Material Selection with Machine Learning and Large Language Models
Methods
Data Collection
LLMs used for evaluation
LLM Experiments
Zero-shot
Few-shot
Parallel Agents
Chain-of-thought
Temperature
Evaluation
...and 10 more sections

Figures (7)

Figure 1: Overview of the method used to create the corpus of questions submitted to survey participants and to the LLMs, the experiments used to evaluate the LLMs, and the evaluation metrics used to compare the LLM results to the survey responses.
Figure 2: Distribution of the 10,544 survey responses from 136 experts grouped by design and criteria.
Figure 3: Aggregate survey and zero-shot LLM responses, showing the full range and quartiles across all designs, criteria, and materials.
Figure 4: Aggregate survey and zero-shot LLM results grouped by material.
Figure 5: Aggregate survey and zero-shot LLM results grouped by design (rows) and criteria (columns).
...and 2 more figures

Evaluating Large Language Models for Material Selection

TL;DR

Abstract

Evaluating Large Language Models for Material Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)