A Multimodal Manufacturing Safety Chatbot: Knowledge Base Design, Benchmark Development, and Evaluation of Multiple RAG Approaches
Ryan Singh, Austin Hamilton, Amanda White, Michael Wise, Ibrahim Yousif, Arthur Carvalho, Zhe Shan, Reza Abrisham Baf, Mohammad Mayyas, Lora A. Cavuoto, Fadel M. Megahed
TL;DR
The paper tackles safety training in Industry 5.0 manufacturing by developing an open-source, multimodal chatbot grounded in regulatory and OEM documents using Retrieval-Augmented Generation (RAG). It follows a Design Science Research framework with a six-phase artifact lifecycle, including six retrieval strategies, a publicly demonstrable interface, and a rigorous benchmark across three machines. Through a full-factorial automated evaluation of 24 pipelines and a blinded human assessment, the study identifies a best-performing configuration—OpenAI Keyword retrieval with gpt-5-mini and top-k=7—that achieves 86.66% accuracy, ~10 s latency, and ~$0.005 per query. The work contributes a domain-grounded safety chatbot, a validated evaluation benchmark, and a systematic methodology for designing and assessing AI-enabled instructional systems for Industry 5.0. Overall, the results illustrate the importance of retrieval strategy and model configuration in balancing accuracy, speed, and cost for real-time, safety-critical guidance in modern manufacturing settings.
Abstract
Ensuring worker safety remains a critical challenge in modern manufacturing environments. Industry 5.0 reorients the prevailing manufacturing paradigm toward more human-centric operations. Using a design science research methodology, we identify three essential requirements for next-generation safety training systems: high accuracy, low latency, and low cost. We introduce a multimodal chatbot powered by large language models that meets these design requirements. The chatbot uses retrieval-augmented generation to ground its responses in curated regulatory and technical documentation. To evaluate our solution, we developed a domain-specific benchmark of expert-validated question and answer pairs for three representative machines: a Bridgeport manual mill, a Haas TL-1 CNC lathe, and a Universal Robots UR5e collaborative robot. We tested 24 RAG configurations using a full-factorial design and assessed them with automated evaluations of correctness, latency, and cost. Our top 2 configurations were then evaluated by ten industry experts and academic researchers. Our results show that retrieval strategy and model configuration have a significant impact on performance. The top configuration (selected for chatbot deployment) achieved an accuracy of 86.66%, an average latency of 10.04 seconds, and an average cost of $0.005 per query. Overall, our work provides three contributions: an open-source, domain-grounded safety training chatbot; a validated benchmark for evaluating AI-assisted safety instruction; and a systematic methodology for designing and assessing AI-enabled instructional and immersive safety training systems for Industry 5.0 environments.
