Table of Contents
Fetching ...

QA-TOOLBOX: Conversational Question-Answering for process task guidance in manufacturing

Ramesh Manuvinakurike, Elizabeth Watkins, Celal Savur, Anthony Rhodes, Sovan Biswas, Gesem Gudino Mejia, Richard Beckwith, Saurav Sahay, Giuseppe Raffa, Lama Nachman

TL;DR

QA-TOOLBOX presents a dataset- and benchmark-driven approach to conversational QA for manufacturing task guidance grounded in specs and observed narrations. It combines Assembly101 and an internal WoZ-collected corpus, with LLM-driven data augmentation to fill missing information and seed questions, evaluated with LLM-as-a-judge and expert validation. The work introduces a baseline system using off-the-shelf LLMs under 15B parameters and a prompt design that limits vision to text, enabling reference-free evaluation. The results show that larger capacity models offer better performance, while judge models align with expert ratings, highlighting practical pathways for deploying manufacturing task assistants and planning future multimodal and retrieval-enhanced extensions.

Abstract

In this work we explore utilizing LLMs for data augmentation for manufacturing task guidance system. The dataset consists of representative samples of interactions with technicians working in an advanced manufacturing setting. The purpose of this work to explore the task, data augmentation for the supported tasks and evaluating the performance of the existing LLMs. We observe that that task is complex requiring understanding from procedure specification documents, actions and objects sequenced temporally. The dataset consists of 200,000+ question/answer pairs that refer to the spec document and are grounded in narrations and/or video demonstrations. We compared the performance of several popular open-sourced LLMs by developing a baseline using each LLM and then compared the responses in a reference-free setting using LLM-as-a-judge and compared the ratings with crowd-workers whilst validating the ratings with experts.

QA-TOOLBOX: Conversational Question-Answering for process task guidance in manufacturing

TL;DR

QA-TOOLBOX presents a dataset- and benchmark-driven approach to conversational QA for manufacturing task guidance grounded in specs and observed narrations. It combines Assembly101 and an internal WoZ-collected corpus, with LLM-driven data augmentation to fill missing information and seed questions, evaluated with LLM-as-a-judge and expert validation. The work introduces a baseline system using off-the-shelf LLMs under 15B parameters and a prompt design that limits vision to text, enabling reference-free evaluation. The results show that larger capacity models offer better performance, while judge models align with expert ratings, highlighting practical pathways for deploying manufacturing task assistants and planning future multimodal and retrieval-enhanced extensions.

Abstract

In this work we explore utilizing LLMs for data augmentation for manufacturing task guidance system. The dataset consists of representative samples of interactions with technicians working in an advanced manufacturing setting. The purpose of this work to explore the task, data augmentation for the supported tasks and evaluating the performance of the existing LLMs. We observe that that task is complex requiring understanding from procedure specification documents, actions and objects sequenced temporally. The dataset consists of 200,000+ question/answer pairs that refer to the spec document and are grounded in narrations and/or video demonstrations. We compared the performance of several popular open-sourced LLMs by developing a baseline using each LLM and then compared the responses in a reference-free setting using LLM-as-a-judge and compared the ratings with crowd-workers whilst validating the ratings with experts.

Paper Structure

This paper contains 25 sections, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Example of task guidance QA exchanges.
  • Figure 2: Workflow for creating QA-TOOLBOX dataset, baseline, and evaluations. Open-sourced dataset augmented with information needed for the manufacturing setting. Baseline systems used LLMs to generate answers. LLM-as-a-judge zheng2024judging evaluated responses, and we performed expert validation to ensure the ratings viability.
  • Figure 3: Radar charts of LLM-as-a-judge for the six models. i) Left chart shows different models' performance on 3 categories of questions measured across Correctness, Conciseness, Completeness and Groundedness. Table \ref{['tab:model_comparison']} shows the values captured in the charts. ii) Right chart shows rating by different LLM-judge models for responses generated by LLama-3-8b-Instruct model.
  • Figure 4: Shows the three different augmentation approaches and format of the data