Table of Contents
Fetching ...

Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

Rehana Mahfuz, Yinyi Guo, Erik Visser, Phanidhar Chinchili

TL;DR

The paper addresses the cognitive load of executing procedural manual tasks by introducing a proactive, real-time conversational assistant that operates entirely on-device using privacy-preserving audio and IMU signals from a wearable. It integrates lightweight activity recognition (audio via CNN10 PANN and IMU via Attend&Discriminate) with a proactive language-model dialogue system guided by a rule-based step tracker, enabling step-by-step instructions and question answering without relying on video or cloud processing. A novel User Whim Agnostic (UWA) LoRA finetuning method improves the model’s tendency to issue important instructions while suppressing low-value chatter, achieving >$30\%$ F-score improvement and $16\times$ speedup compared to standard prompting; a dataset of 600 conversations and extensive on-device results validate the approach. The contributions include a full edge-on-device implementation (Android wear + edge AI box with Whisper and MeloTTS), a data-generation pipeline for task-specific logs, and comprehensive metrics (SentenceBERT, BERTScore, entailment, and TNR) supported by human evaluation, highlighting improved user experience and privacy-preserving operation. The work sets the stage for extending proactive wearable-guided guidance to more tasks and automating task enrollment, while discussing limitations such as wrist-dominance assumptions and timing decisions.

Abstract

Real-time conversational assistants for procedural tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for a procedural task using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. This assistant proactively communicates step-by-step instructions to a user performing a furniture assembly task, and answers user questions. We construct a dataset containing conversations where the assistant guides the user in performing the task. On observing that an off-the-shelf language model is a very talkative assistant, we design a novel User Whim Agnostic (UWA) LoRA finetuning method which improves the model's ability to suppress less informative dialogues, while maintaining its tendency to communicate important instructions. This leads to >30% improvement in the F-score. Finetuning the model also results in a 16x speedup by eliminating the need to provide in-context examples in the prompt. We further describe how such an assistant is implemented on edge devices with no dependence on the cloud.

Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU

TL;DR

The paper addresses the cognitive load of executing procedural manual tasks by introducing a proactive, real-time conversational assistant that operates entirely on-device using privacy-preserving audio and IMU signals from a wearable. It integrates lightweight activity recognition (audio via CNN10 PANN and IMU via Attend&Discriminate) with a proactive language-model dialogue system guided by a rule-based step tracker, enabling step-by-step instructions and question answering without relying on video or cloud processing. A novel User Whim Agnostic (UWA) LoRA finetuning method improves the model’s tendency to issue important instructions while suppressing low-value chatter, achieving > F-score improvement and speedup compared to standard prompting; a dataset of 600 conversations and extensive on-device results validate the approach. The contributions include a full edge-on-device implementation (Android wear + edge AI box with Whisper and MeloTTS), a data-generation pipeline for task-specific logs, and comprehensive metrics (SentenceBERT, BERTScore, entailment, and TNR) supported by human evaluation, highlighting improved user experience and privacy-preserving operation. The work sets the stage for extending proactive wearable-guided guidance to more tasks and automating task enrollment, while discussing limitations such as wrist-dominance assumptions and timing decisions.

Abstract

Real-time conversational assistants for procedural tasks often depend on video input, which can be computationally expensive and compromise user privacy. For the first time, we propose a real-time conversational assistant that provides comprehensive guidance for a procedural task using only lightweight privacy-preserving modalities such as audio and IMU inputs from a user's wearable device to understand the context. This assistant proactively communicates step-by-step instructions to a user performing a furniture assembly task, and answers user questions. We construct a dataset containing conversations where the assistant guides the user in performing the task. On observing that an off-the-shelf language model is a very talkative assistant, we design a novel User Whim Agnostic (UWA) LoRA finetuning method which improves the model's ability to suppress less informative dialogues, while maintaining its tendency to communicate important instructions. This leads to >30% improvement in the F-score. Finetuning the model also results in a 16x speedup by eliminating the need to provide in-context examples in the prompt. We further describe how such an assistant is implemented on edge devices with no dependence on the cloud.
Paper Structure (23 sections, 3 figures, 9 tables)

This paper contains 23 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Design of our proactive situated conversational assistant. Occurrences of user comments or recognized activities trigger calls to the language model that provides responses as necessary.
  • Figure 2: Flowchart showing the furniture assembly task.
  • Figure 3: Diagram illustrating on-device implementation of the proactive situated assistant.