GRILLBot In Practice: Lessons and Tradeoffs Deploying Large Language Models for Adaptable Conversational Task Assistants

Sophie Fischer; Carlos Gemmell; Niklas Tecklenburg; Iain Mackie; Federico Rossetto; Jeffrey Dalton

GRILLBot In Practice: Lessons and Tradeoffs Deploying Large Language Models for Adaptable Conversational Task Assistants

Sophie Fischer, Carlos Gemmell, Niklas Tecklenburg, Iain Mackie, Federico Rossetto, Jeffrey Dalton

TL;DR

GRILLBot demonstrates a practical, hybrid approach to deploying large language models for adaptable conversational task assistants in a real-world setting. By combining a Neural Decision Parser for low-latency system actions with LLM-driven knowledge-grounded QA and live task adaptation, the system achieves responsive interactions while maintaining grounding and safety. The work provides extensive component-level evaluations, revealing clear latency–accuracy tradeoffs and showing where smaller, specialized models excel versus where larger LLMs add value, along with a new task-oriented QA dataset WoTe. Reproducibility is emphasized through the OAT framework and open data release, offering a roadmap for deploying future task-oriented assistants at scale.

Abstract

We tackle the challenge of building real-world multimodal assistants for complex real-world tasks. We describe the practicalities and challenges of developing and deploying GRILLBot, a leading (first and second prize winning in 2022 and 2023) system deployed in the Alexa Prize TaskBot Challenge. Building on our Open Assistant Toolkit (OAT) framework, we propose a hybrid architecture that leverages Large Language Models (LLMs) and specialised models tuned for specific subtasks requiring very low latency. OAT allows us to define when, how and which LLMs should be used in a structured and deployable manner. For knowledge-grounded question answering and live task adaptations, we show that LLM reasoning abilities over task context and world knowledge outweigh latency concerns. For dialogue state management, we implement a code generation approach and show that specialised smaller models have 84% effectiveness with 100x lower latency. Overall, we provide insights and discuss tradeoffs for deploying both traditional models and LLMs to users in complex real-world multimodal environments in the Alexa TaskBot challenge. These experiences will continue to evolve as LLMs become more capable and efficient -- fundamentally reshaping OAT and future assistant architectures.

GRILLBot In Practice: Lessons and Tradeoffs Deploying Large Language Models for Adaptable Conversational Task Assistants

TL;DR

Abstract

Paper Structure (30 sections, 11 figures, 9 tables)

This paper contains 30 sections, 11 figures, 9 tables.

Introduction
Related Work
End-to-end dialogue models
Modular Agent Architectures
State management
Task-specific question answering
Dynamic Task Adaptation
Implementation details
TaskBot Task
Online GRILLBot System Architecture
Code generation for dialogue management
Task-specific retrieval-augmented question answering
Live generative task adaption
Lessons Learned and Shortcomings
Evaluation
...and 15 more sections

Figures (11)

Figure 1: A multimodal conversation with OAT including task adaptation and question answering with system actions by the NDP in green.
Figure 2: Online architecture of GRILLBot based on OAT OAT. We implement NDP (\ref{['code_generation']}) & QA (\ref{['qa']}) in Neural functionalities and task adaptation in (\ref{['task_adaptation']}) in LLM functionalities.
Figure 3: Live task adaptation based on the Replacement Generator and Task Rewriter.
Figure 4: Generated action distribution from conversations where a task is started compared to exploratory-only.
Figure 5: Prompt fed into the Alpaca model when the NDP generates a system action that doesn't have built back-end logic, aka no system action should be performed live.
...and 6 more figures

GRILLBot In Practice: Lessons and Tradeoffs Deploying Large Language Models for Adaptable Conversational Task Assistants

TL;DR

Abstract

GRILLBot In Practice: Lessons and Tradeoffs Deploying Large Language Models for Adaptable Conversational Task Assistants

Authors

TL;DR

Abstract

Table of Contents

Figures (11)