Table of Contents
Fetching ...

Do Multimodal Large Language Models Understand Welding?

Grigorii Khvatskii, Yong Suk Lee, Corey Angst, Maria Gibbs, Robert Landers, Nitesh V. Chawla

TL;DR

This study assesses multimodal LLMs for welding quality assessment by constructing Real-World and Online weld image datasets annotated by an expert and evaluating GPT-4o and LLaVA-1.6 under zero-shot and WeldPrompt prompting across RV & Marine, Aeronautical, and Farming contexts. It demonstrates that online images yield higher performance than real-world images and that general-purpose MLLMs face generalization limits in high-stakes, domain-specific tasks, though domain-informed prompting (WeldPrompt) can improve alignment with expert judgments in some cases. The results highlight the need for domain-specific fine-tuning, robust perception-reasoning pipelines, and retrieval-augmented or XAI-enabled strategies to improve reliability in manufacturing. Overall, the work informs future directions for applying MLLMs in Industry 4.0/5.0, balancing cost, accuracy, and interpretability in real-world welding workflows.

Abstract

This paper examines the performance of Multimodal LLMs (MLLMs) in skilled production work, with a focus on welding. Using a novel data set of real-world and online weld images, annotated by a domain expert, we evaluate the performance of two state-of-the-art MLLMs in assessing weld acceptability across three contexts: RV \& Marine, Aeronautical, and Farming. While both models perform better on online images, likely due to prior exposure or memorization, they also perform relatively well on unseen, real-world weld images. Additionally, we introduce WeldPrompt, a prompting strategy that combines Chain-of-Thought generation with in-context learning to mitigate hallucinations and improve reasoning. WeldPrompt improves model recall in certain contexts but exhibits inconsistent performance across others. These results underscore the limitations and potentials of MLLMs in high-stakes technical domains and highlight the importance of fine-tuning, domain-specific data, and more sophisticated prompting strategies to improve model reliability. The study opens avenues for further research into multimodal learning in industry applications.

Do Multimodal Large Language Models Understand Welding?

TL;DR

This study assesses multimodal LLMs for welding quality assessment by constructing Real-World and Online weld image datasets annotated by an expert and evaluating GPT-4o and LLaVA-1.6 under zero-shot and WeldPrompt prompting across RV & Marine, Aeronautical, and Farming contexts. It demonstrates that online images yield higher performance than real-world images and that general-purpose MLLMs face generalization limits in high-stakes, domain-specific tasks, though domain-informed prompting (WeldPrompt) can improve alignment with expert judgments in some cases. The results highlight the need for domain-specific fine-tuning, robust perception-reasoning pipelines, and retrieval-augmented or XAI-enabled strategies to improve reliability in manufacturing. Overall, the work informs future directions for applying MLLMs in Industry 4.0/5.0, balancing cost, accuracy, and interpretability in real-world welding workflows.

Abstract

This paper examines the performance of Multimodal LLMs (MLLMs) in skilled production work, with a focus on welding. Using a novel data set of real-world and online weld images, annotated by a domain expert, we evaluate the performance of two state-of-the-art MLLMs in assessing weld acceptability across three contexts: RV \& Marine, Aeronautical, and Farming. While both models perform better on online images, likely due to prior exposure or memorization, they also perform relatively well on unseen, real-world weld images. Additionally, we introduce WeldPrompt, a prompting strategy that combines Chain-of-Thought generation with in-context learning to mitigate hallucinations and improve reasoning. WeldPrompt improves model recall in certain contexts but exhibits inconsistent performance across others. These results underscore the limitations and potentials of MLLMs in high-stakes technical domains and highlight the importance of fine-tuning, domain-specific data, and more sophisticated prompting strategies to improve model reliability. The study opens avenues for further research into multimodal learning in industry applications.

Paper Structure

This paper contains 13 sections, 5 tables, 3 algorithms.