Table of Contents
Fetching ...

A Multimodal Manufacturing Safety Chatbot: Knowledge Base Design, Benchmark Development, and Evaluation of Multiple RAG Approaches

Ryan Singh, Austin Hamilton, Amanda White, Michael Wise, Ibrahim Yousif, Arthur Carvalho, Zhe Shan, Reza Abrisham Baf, Mohammad Mayyas, Lora A. Cavuoto, Fadel M. Megahed

TL;DR

The paper tackles safety training in Industry 5.0 manufacturing by developing an open-source, multimodal chatbot grounded in regulatory and OEM documents using Retrieval-Augmented Generation (RAG). It follows a Design Science Research framework with a six-phase artifact lifecycle, including six retrieval strategies, a publicly demonstrable interface, and a rigorous benchmark across three machines. Through a full-factorial automated evaluation of 24 pipelines and a blinded human assessment, the study identifies a best-performing configuration—OpenAI Keyword retrieval with gpt-5-mini and top-k=7—that achieves 86.66% accuracy, ~10 s latency, and ~$0.005 per query. The work contributes a domain-grounded safety chatbot, a validated evaluation benchmark, and a systematic methodology for designing and assessing AI-enabled instructional systems for Industry 5.0. Overall, the results illustrate the importance of retrieval strategy and model configuration in balancing accuracy, speed, and cost for real-time, safety-critical guidance in modern manufacturing settings.

Abstract

Ensuring worker safety remains a critical challenge in modern manufacturing environments. Industry 5.0 reorients the prevailing manufacturing paradigm toward more human-centric operations. Using a design science research methodology, we identify three essential requirements for next-generation safety training systems: high accuracy, low latency, and low cost. We introduce a multimodal chatbot powered by large language models that meets these design requirements. The chatbot uses retrieval-augmented generation to ground its responses in curated regulatory and technical documentation. To evaluate our solution, we developed a domain-specific benchmark of expert-validated question and answer pairs for three representative machines: a Bridgeport manual mill, a Haas TL-1 CNC lathe, and a Universal Robots UR5e collaborative robot. We tested 24 RAG configurations using a full-factorial design and assessed them with automated evaluations of correctness, latency, and cost. Our top 2 configurations were then evaluated by ten industry experts and academic researchers. Our results show that retrieval strategy and model configuration have a significant impact on performance. The top configuration (selected for chatbot deployment) achieved an accuracy of 86.66%, an average latency of 10.04 seconds, and an average cost of $0.005 per query. Overall, our work provides three contributions: an open-source, domain-grounded safety training chatbot; a validated benchmark for evaluating AI-assisted safety instruction; and a systematic methodology for designing and assessing AI-enabled instructional and immersive safety training systems for Industry 5.0 environments.

A Multimodal Manufacturing Safety Chatbot: Knowledge Base Design, Benchmark Development, and Evaluation of Multiple RAG Approaches

TL;DR

The paper tackles safety training in Industry 5.0 manufacturing by developing an open-source, multimodal chatbot grounded in regulatory and OEM documents using Retrieval-Augmented Generation (RAG). It follows a Design Science Research framework with a six-phase artifact lifecycle, including six retrieval strategies, a publicly demonstrable interface, and a rigorous benchmark across three machines. Through a full-factorial automated evaluation of 24 pipelines and a blinded human assessment, the study identifies a best-performing configuration—OpenAI Keyword retrieval with gpt-5-mini and top-k=7—that achieves 86.66% accuracy, ~10 s latency, and ~$0.005 per query. The work contributes a domain-grounded safety chatbot, a validated evaluation benchmark, and a systematic methodology for designing and assessing AI-enabled instructional systems for Industry 5.0. Overall, the results illustrate the importance of retrieval strategy and model configuration in balancing accuracy, speed, and cost for real-time, safety-critical guidance in modern manufacturing settings.

Abstract

Ensuring worker safety remains a critical challenge in modern manufacturing environments. Industry 5.0 reorients the prevailing manufacturing paradigm toward more human-centric operations. Using a design science research methodology, we identify three essential requirements for next-generation safety training systems: high accuracy, low latency, and low cost. We introduce a multimodal chatbot powered by large language models that meets these design requirements. The chatbot uses retrieval-augmented generation to ground its responses in curated regulatory and technical documentation. To evaluate our solution, we developed a domain-specific benchmark of expert-validated question and answer pairs for three representative machines: a Bridgeport manual mill, a Haas TL-1 CNC lathe, and a Universal Robots UR5e collaborative robot. We tested 24 RAG configurations using a full-factorial design and assessed them with automated evaluations of correctness, latency, and cost. Our top 2 configurations were then evaluated by ten industry experts and academic researchers. Our results show that retrieval strategy and model configuration have a significant impact on performance. The top configuration (selected for chatbot deployment) achieved an accuracy of 86.66%, an average latency of 10.04 seconds, and an average cost of $0.005 per query. Overall, our work provides three contributions: an open-source, domain-grounded safety training chatbot; a validated benchmark for evaluating AI-assisted safety instruction; and a systematic methodology for designing and assessing AI-enabled instructional and immersive safety training systems for Industry 5.0 environments.

Paper Structure

This paper contains 25 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: An overview of our proposed six-phase framework for developing, demonstrating, and evaluating a multimodal manufacturing-safety chatbot.
  • Figure 2: Regulatory, safety, and OEM documents forming our safety corpus.
  • Figure 3: A screenshot of the voice module of our deployed chatbot. The chatbot is available at https://sight.fsb.miamioh.edu/.
  • Figure 4: A screenshot of the interface used for human evaluation.
  • Figure 5: Comparison of the performance of our 24 RAG pipelines across correctness (A), generation time, i.e., latency (B), and cost (C). The y-axis in each subplot utilizes the LLMModel_RagApproach_TopKChunksUsedAsInput naming convention. For each subfigure, we sort the pipelines by performance, placing the best at the top (using median performance for the box plots).