Multimodal Reinforcement Learning for Robots Collaborating with Humans

Afagh Mehri Shervedani; Siyu Li; Natawut Monaikul; Bahareh Abbasi; Barbara Di Eugenio; Milos Zefran

Multimodal Reinforcement Learning for Robots Collaborating with Humans

Afagh Mehri Shervedani, Siyu Li, Natawut Monaikul, Bahareh Abbasi, Barbara Di Eugenio, Milos Zefran

TL;DR

This work addresses the challenge of building scalable, multimodal interaction managers for assistive robots that collaborate with humans. It introduces an RL-based policy trained in a neural user simulator that models language and physical actions, bootstrapped with a DAGGER warm-up and trained via Deep-Q-Learning, with a reward structure that includes $-r$, $-2r$, and $+2r$ terms. The authors implement the framework on a Baxter robot, integrating a perception module with an ALBERT-based dialogue-act classifier and a rule-based speech generator, and validate the approach through a human study (12 participants, 75 trials) showing low non-understanding rates (~9.8%), high task success (~96%), and strong user satisfaction. Compared to a HBATN baseline, the RL system achieves higher real-time accuracy (~97.5%), lower SSREs (6.4%), and improved user-perceived quality, suggesting the approach provides a scalable, robust path to multimodal human-robot collaboration. The work contributes a full end-to-end pipeline—from data-driven user simulation and RL training to robot-level perception/execution and human evaluation—that can generalize to additional tasks and environments.

Abstract

Robot assistants for older adults and people with disabilities need to interact with their users in collaborative tasks. The core component of these systems is an interaction manager whose job is to observe and assess the task, and infer the state of the human and their intent to choose the best course of action for the robot. Due to the sparseness of the data in this domain, the policy for such multi-modal systems is often crafted by hand; as the complexity of interactions grows this process is not scalable. In this paper, we propose a reinforcement learning (RL) approach to learn the robot policy. In contrast to the dialog systems, our agent is trained with a simulator developed by using human data and can deal with multiple modalities such as language and physical actions. We conducted a human study to evaluate the performance of the system in the interaction with a user. Our designed system shows promising preliminary results when it is used by a real user.

Multimodal Reinforcement Learning for Robots Collaborating with Humans

TL;DR

, and

terms. The authors implement the framework on a Baxter robot, integrating a perception module with an ALBERT-based dialogue-act classifier and a rule-based speech generator, and validate the approach through a human study (12 participants, 75 trials) showing low non-understanding rates (~9.8%), high task success (~96%), and strong user satisfaction. Compared to a HBATN baseline, the RL system achieves higher real-time accuracy (~97.5%), lower SSREs (6.4%), and improved user-perceived quality, suggesting the approach provides a scalable, robust path to multimodal human-robot collaboration. The work contributes a full end-to-end pipeline—from data-driven user simulation and RL training to robot-level perception/execution and human evaluation—that can generalize to additional tasks and environments.

Abstract

Paper Structure (21 sections, 4 figures, 4 tables)

This paper contains 21 sections, 4 figures, 4 tables.

Introduction
Related Work
User Simulator
Feature Extraction
Data Annotation
Data Augmentation
Model Architecture and Training
Reinforcement Learning Framework
Model Architecture
DA GGER Warm-up
Deep-Q-Learning
Experimental Evaluations
Robot Implementation
Perception Module
Execution Module
...and 6 more sections

Figures (4)

Figure 1: The Sense-Plan-Act cycle in an assistive robot
Figure 2: The interaction between the User Simulator and the HEL agent during RL
Figure 3: DA GGER Algorithm Training Evaluations
Figure 4: DQL Algorithm Training Evaluations

Multimodal Reinforcement Learning for Robots Collaborating with Humans

TL;DR

Abstract

Multimodal Reinforcement Learning for Robots Collaborating with Humans

Authors

TL;DR

Abstract

Table of Contents

Figures (4)