Learning Multimodal Confidence for Intention Recognition in Human-Robot Interaction
Xiyuan Zhao, Huijun Li, Tianyuan Miao, Xianyi Zhu, Zhikai Wei, Aiguo Song
TL;DR
The paper addresses uncertainty in multimodal intention recognition for HRI by introducing Batch Multimodal Confidence Learning for Opinion Pool (BMCLOP), a constrained, learning-based fusion framework. It combines Bayesian Opinion Pool fusion with batch learning to adapt modality confidences, formulating the objective as $\mathcal{L}(\boldsymbol{\omega}) = \mathbb{E}_{x\sim\mathcal{D}} D_{KL}(P(a) \;||\; P(a|m_1,...,m_K))$ and solving it via a primal–dual approach that updates $\boldsymbol{\omega}$ with SGD and the dual variables with Exponentiated Gradient. The method demonstrates that the extended IOP (EIOP) fusion, with learning-based confidence, outperforms IOP and LogOP in cluttered kitchen scenarios, improving accuracy, reducing uncertainty (entropy), and achieving high success rates (e.g., 97.33%). This approach enables robust, adaptive HRI in real-world environments and provides a pathway for extending to additional modalities and online control strategies. Overall, BMCLOP offers a principled, data-efficient way to tailor multimodal fusion to current interaction conditions, enhancing natural and reliable robot assistance.
Abstract
The rapid development of collaborative robotics has provided a new possibility of helping the elderly who has difficulties in daily life, allowing robots to operate according to specific intentions. However, efficient human-robot cooperation requires natural, accurate and reliable intention recognition in shared environments. The current paramount challenge for this is reducing the uncertainty of multimodal fused intention to be recognized and reasoning adaptively a more reliable result despite current interactive condition. In this work we propose a novel learning-based multimodal fusion framework Batch Multimodal Confidence Learning for Opinion Pool (BMCLOP). Our approach combines Bayesian multimodal fusion method and batch confidence learning algorithm to improve accuracy, uncertainty reduction and success rate given the interactive condition. In particular, the generic and practical multimodal intention recognition framework can be easily extended further. Our desired assistive scenarios consider three modalities gestures, speech and gaze, all of which produce categorical distributions over all the finite intentions. The proposed method is validated with a six-DoF robot through extensive experiments and exhibits high performance compared to baselines.
