QA-TOOLBOX: Conversational Question-Answering for process task guidance in manufacturing
Ramesh Manuvinakurike, Elizabeth Watkins, Celal Savur, Anthony Rhodes, Sovan Biswas, Gesem Gudino Mejia, Richard Beckwith, Saurav Sahay, Giuseppe Raffa, Lama Nachman
TL;DR
QA-TOOLBOX presents a dataset- and benchmark-driven approach to conversational QA for manufacturing task guidance grounded in specs and observed narrations. It combines Assembly101 and an internal WoZ-collected corpus, with LLM-driven data augmentation to fill missing information and seed questions, evaluated with LLM-as-a-judge and expert validation. The work introduces a baseline system using off-the-shelf LLMs under 15B parameters and a prompt design that limits vision to text, enabling reference-free evaluation. The results show that larger capacity models offer better performance, while judge models align with expert ratings, highlighting practical pathways for deploying manufacturing task assistants and planning future multimodal and retrieval-enhanced extensions.
Abstract
In this work we explore utilizing LLMs for data augmentation for manufacturing task guidance system. The dataset consists of representative samples of interactions with technicians working in an advanced manufacturing setting. The purpose of this work to explore the task, data augmentation for the supported tasks and evaluating the performance of the existing LLMs. We observe that that task is complex requiring understanding from procedure specification documents, actions and objects sequenced temporally. The dataset consists of 200,000+ question/answer pairs that refer to the spec document and are grounded in narrations and/or video demonstrations. We compared the performance of several popular open-sourced LLMs by developing a baseline using each LLM and then compared the responses in a reference-free setting using LLM-as-a-judge and compared the ratings with crowd-workers whilst validating the ratings with experts.
