Table of Contents
Fetching ...

FDM-Bench: A Comprehensive Benchmark for Evaluating Large Language Models in Additive Manufacturing Tasks

Ahmadreza Eslaminia, Adrian Jackson, Beitong Tian, Avi Stern, Hallie Gordon, Rajiv Malhotra, Klara Nahrstedt, Chenhui Shao

TL;DR

FDM-Bench addresses the need for standardized evaluation of large language models in additive manufacturing, focusing on FDM-specific tasks. It defines two tasks: G-code anomaly detection and user-query response, with data spanning diverse expertise levels and G-code anomalies. The study compares four LLMs (GPT-4o, Claude 3.5 Sonnet, Llama-3.1-70B, Llama-3.1-405B) and finds closed-source models generally superior for G-code anomaly detection, while Llama-3.1-405B matches or rivals closed-source performance on user queries. The results underscore the potential of open-source LLMs and suggest that further gains can be achieved via fine-tuning, retrieval augmentation, and advanced prompting, enabling robust LLM support for AM design and manufacturing.

Abstract

Fused Deposition Modeling (FDM) is a widely used additive manufacturing (AM) technique valued for its flexibility and cost-efficiency, with applications in a variety of industries including healthcare and aerospace. Recent developments have made affordable FDM machines accessible and encouraged adoption among diverse users. However, the design, planning, and production process in FDM require specialized interdisciplinary knowledge. Managing the complex parameters and resolving print defects in FDM remain challenging. These technical complexities form the most critical barrier preventing individuals without technical backgrounds and even professional engineers without training in other domains from participating in AM design and manufacturing. Large Language Models (LLMs), with their advanced capabilities in text and code processing, offer the potential for addressing these challenges in FDM. However, existing research on LLM applications in this field is limited, typically focusing on specific use cases without providing comprehensive evaluations across multiple models and tasks. To this end, we introduce FDM-Bench, a benchmark dataset designed to evaluate LLMs on FDM-specific tasks. FDM-Bench enables a thorough assessment by including user queries across various experience levels and G-code samples that represent a range of anomalies. We evaluate two closed-source models (GPT-4o and Claude 3.5 Sonnet) and two open-source models (Llama-3.1-70B and Llama-3.1-405B) on FDM-Bench. A panel of FDM experts assess the models' responses to user queries in detail. Results indicate that closed-source models generally outperform open-source models in G-code anomaly detection, whereas Llama-3.1-405B demonstrates a slight advantage over other models in responding to user queries. These findings underscore FDM-Bench's potential as a foundational tool for advancing research on LLM capabilities in FDM.

FDM-Bench: A Comprehensive Benchmark for Evaluating Large Language Models in Additive Manufacturing Tasks

TL;DR

FDM-Bench addresses the need for standardized evaluation of large language models in additive manufacturing, focusing on FDM-specific tasks. It defines two tasks: G-code anomaly detection and user-query response, with data spanning diverse expertise levels and G-code anomalies. The study compares four LLMs (GPT-4o, Claude 3.5 Sonnet, Llama-3.1-70B, Llama-3.1-405B) and finds closed-source models generally superior for G-code anomaly detection, while Llama-3.1-405B matches or rivals closed-source performance on user queries. The results underscore the potential of open-source LLMs and suggest that further gains can be achieved via fine-tuning, retrieval augmentation, and advanced prompting, enabling robust LLM support for AM design and manufacturing.

Abstract

Fused Deposition Modeling (FDM) is a widely used additive manufacturing (AM) technique valued for its flexibility and cost-efficiency, with applications in a variety of industries including healthcare and aerospace. Recent developments have made affordable FDM machines accessible and encouraged adoption among diverse users. However, the design, planning, and production process in FDM require specialized interdisciplinary knowledge. Managing the complex parameters and resolving print defects in FDM remain challenging. These technical complexities form the most critical barrier preventing individuals without technical backgrounds and even professional engineers without training in other domains from participating in AM design and manufacturing. Large Language Models (LLMs), with their advanced capabilities in text and code processing, offer the potential for addressing these challenges in FDM. However, existing research on LLM applications in this field is limited, typically focusing on specific use cases without providing comprehensive evaluations across multiple models and tasks. To this end, we introduce FDM-Bench, a benchmark dataset designed to evaluate LLMs on FDM-specific tasks. FDM-Bench enables a thorough assessment by including user queries across various experience levels and G-code samples that represent a range of anomalies. We evaluate two closed-source models (GPT-4o and Claude 3.5 Sonnet) and two open-source models (Llama-3.1-70B and Llama-3.1-405B) on FDM-Bench. A panel of FDM experts assess the models' responses to user queries in detail. Results indicate that closed-source models generally outperform open-source models in G-code anomaly detection, whereas Llama-3.1-405B demonstrates a slight advantage over other models in responding to user queries. These findings underscore FDM-Bench's potential as a foundational tool for advancing research on LLM capabilities in FDM.

Paper Structure

This paper contains 25 sections, 5 figures.

Figures (5)

  • Figure 1: Printed parts illustrating different FDM quality classes: (a) ND, (b) UE, (c) OE, and (d) SP.
  • Figure 2: Confusion matrices for G-code anomaly detection across four LLM models: (a) GPT-4o, (b) Claude 3.5 Sonnet, (c) Llama 70B, and (d) Llama 405B.
  • Figure 3: Average probability assigned by each LLM model to the correct label for each G-code class. Crosses indicate the average probability assigned to each label across all samples.
  • Figure 4: Average scores (1–5 scale) for each LLM model on free-form user query responses across three metrics: accuracy, precision, and relevance.
  • Figure 5: Accuracy comparison of LLM models in answering multiple-choice questions across different user expertise levels.