Learning Multimodal Confidence for Intention Recognition in Human-Robot Interaction

Xiyuan Zhao; Huijun Li; Tianyuan Miao; Xianyi Zhu; Zhikai Wei; Aiguo Song

Learning Multimodal Confidence for Intention Recognition in Human-Robot Interaction

Xiyuan Zhao, Huijun Li, Tianyuan Miao, Xianyi Zhu, Zhikai Wei, Aiguo Song

TL;DR

The paper addresses uncertainty in multimodal intention recognition for HRI by introducing Batch Multimodal Confidence Learning for Opinion Pool (BMCLOP), a constrained, learning-based fusion framework. It combines Bayesian Opinion Pool fusion with batch learning to adapt modality confidences, formulating the objective as $\mathcal{L}(\boldsymbol{\omega}) = \mathbb{E}_{x\sim\mathcal{D}} D_{KL}(P(a) \;||\; P(a|m_1,...,m_K))$ and solving it via a primal–dual approach that updates $\boldsymbol{\omega}$ with SGD and the dual variables with Exponentiated Gradient. The method demonstrates that the extended IOP (EIOP) fusion, with learning-based confidence, outperforms IOP and LogOP in cluttered kitchen scenarios, improving accuracy, reducing uncertainty (entropy), and achieving high success rates (e.g., 97.33%). This approach enables robust, adaptive HRI in real-world environments and provides a pathway for extending to additional modalities and online control strategies. Overall, BMCLOP offers a principled, data-efficient way to tailor multimodal fusion to current interaction conditions, enhancing natural and reliable robot assistance.

Abstract

The rapid development of collaborative robotics has provided a new possibility of helping the elderly who has difficulties in daily life, allowing robots to operate according to specific intentions. However, efficient human-robot cooperation requires natural, accurate and reliable intention recognition in shared environments. The current paramount challenge for this is reducing the uncertainty of multimodal fused intention to be recognized and reasoning adaptively a more reliable result despite current interactive condition. In this work we propose a novel learning-based multimodal fusion framework Batch Multimodal Confidence Learning for Opinion Pool (BMCLOP). Our approach combines Bayesian multimodal fusion method and batch confidence learning algorithm to improve accuracy, uncertainty reduction and success rate given the interactive condition. In particular, the generic and practical multimodal intention recognition framework can be easily extended further. Our desired assistive scenarios consider three modalities gestures, speech and gaze, all of which produce categorical distributions over all the finite intentions. The proposed method is validated with a six-DoF robot through extensive experiments and exhibits high performance compared to baselines.

Learning Multimodal Confidence for Intention Recognition in Human-Robot Interaction

TL;DR

and solving it via a primal–dual approach that updates

with SGD and the dual variables with Exponentiated Gradient. The method demonstrates that the extended IOP (EIOP) fusion, with learning-based confidence, outperforms IOP and LogOP in cluttered kitchen scenarios, improving accuracy, reducing uncertainty (entropy), and achieving high success rates (e.g., 97.33%). This approach enables robust, adaptive HRI in real-world environments and provides a pathway for extending to additional modalities and online control strategies. Overall, BMCLOP offers a principled, data-efficient way to tailor multimodal fusion to current interaction conditions, enhancing natural and reliable robot assistance.

Abstract

Paper Structure (15 sections, 2 theorems, 16 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 15 sections, 2 theorems, 16 equations, 5 figures, 1 table, 1 algorithm.

Introduction
Proposed Method
Opinion Pool
Batch Confidence Learning with Constraints
Multimodal Interaction and Learning Procedure
Considered Modalities
Speech
Gestures
Gaze
Experiments And Evaluations
Experimental Setup
Multi-Scenarios Evaluation for Multimodal Fusion
Confidence Learning Analysis
Online Human-Robot Interaction and Ablation Study
Conclusion

Key Result

Proposition 1

Let $\Omega$ be a convex set of confidence. There is only one fused distribution $P(\bm{a}|m_1,...,m_K)$ to minimize the loss $\mathcal{L}(\bm{\omega})$.

Figures (5)

Figure 1: Block diagram of the proposed multimodal fusion framework BMCLOP. When the interactive condition is changed, batch learning with constraints is used to learn the confidence from HRI experiences, in which the deterministic feedback advice as ground truth is given in addition. The object intention is our main focus. $Gestures$: The dashed lines before Leap Motion divide the area into five direction intervals. The bring and place gestures implicitly specify the direction. $Gaze$: We obtain the gaze point in the detected surfaces from Pupil Core and utilize Gaussian distribution to model gaze object distribution. $Confidence~Constraints$: We visualize the original and experimental confidence spaces $\bm{\Omega}$ of IOP, LogOP and EIOP respectively.
Figure 2: Two robotic kitchen scenarios. We want to prepare some fruit salad, but the objects are all hard to reach. To recognize human's intention, a microphone (A) and Leap Motion (B) are placed on the workspace with head-mounted Pupil glasses (C). Six or ten target objects on the table are object intentions, with each direction colored differently. The touch screen is used for advice input and visual feedback of transparent HRI.
Figure 3: Object entropy and confidence regulation of considered modalities over two kitchen scenarios. Each case involves bringing or replacing objects. The order of interactions is sequential by numbers. After learning from batch, EIOP outperforms all base modalities and the baseline IOP, LogOP.
Figure 4: An extreme case of multimodal fusion in the cluttered scenario (Scenario 2). The human intended to take object 5. Due to the high uncertainty from the limited reference of speech and gesture, there are multiple candidates (speech) or large deviation (gesture) with only the correct recognition result of gaze. The fusion of IOP and LogOP failed to strengthen the reliable base distribution, while EIOP achieved the interactive intention satisfactorily.
Figure 5: The learning curves of EIOP with the first interactive batch in Scenario 2. Averaged confidence and primal-dual gap across 30 runs are presented over different random seeds. Shadow: 95% confidence interval.

Theorems & Definitions (2)

Proposition 1
Proposition 2

Learning Multimodal Confidence for Intention Recognition in Human-Robot Interaction

TL;DR

Abstract

Learning Multimodal Confidence for Intention Recognition in Human-Robot Interaction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)