Table of Contents
Fetching ...

Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation

Ruoxuan Feng, Di Hu, Wenke Ma, Xuelong Li

TL;DR

MS-Bot is proposed, a stage-guided dynamic multi-sensory fusion method with coarse-to-fine stage understanding, which dynamically adjusts the priority of modalities based on the fine-grained state within the predicted current stage within the predicted current stage.

Abstract

Humans possess a remarkable talent for flexibly alternating to different senses when interacting with the environment. Picture a chef skillfully gauging the timing of ingredient additions and controlling the heat according to the colors, sounds, and aromas, seamlessly navigating through every stage of the complex cooking process. This ability is founded upon a thorough comprehension of task stages, as achieving the sub-goal within each stage can necessitate the utilization of different senses. In order to endow robots with similar ability, we incorporate the task stages divided by sub-goals into the imitation learning process to accordingly guide dynamic multi-sensory fusion. We propose MS-Bot, a stage-guided dynamic multi-sensory fusion method with coarse-to-fine stage understanding, which dynamically adjusts the priority of modalities based on the fine-grained state within the predicted current stage. We train a robot system equipped with visual, auditory, and tactile sensors to accomplish challenging robotic manipulation tasks: pouring and peg insertion with keyway. Experimental results indicate that our approach enables more effective and explainable dynamic fusion, aligning more closely with the human fusion process than existing methods.

Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation

TL;DR

MS-Bot is proposed, a stage-guided dynamic multi-sensory fusion method with coarse-to-fine stage understanding, which dynamically adjusts the priority of modalities based on the fine-grained state within the predicted current stage within the predicted current stage.

Abstract

Humans possess a remarkable talent for flexibly alternating to different senses when interacting with the environment. Picture a chef skillfully gauging the timing of ingredient additions and controlling the heat according to the colors, sounds, and aromas, seamlessly navigating through every stage of the complex cooking process. This ability is founded upon a thorough comprehension of task stages, as achieving the sub-goal within each stage can necessitate the utilization of different senses. In order to endow robots with similar ability, we incorporate the task stages divided by sub-goals into the imitation learning process to accordingly guide dynamic multi-sensory fusion. We propose MS-Bot, a stage-guided dynamic multi-sensory fusion method with coarse-to-fine stage understanding, which dynamically adjusts the priority of modalities based on the fine-grained state within the predicted current stage. We train a robot system equipped with visual, auditory, and tactile sensors to accomplish challenging robotic manipulation tasks: pouring and peg insertion with keyway. Experimental results indicate that our approach enables more effective and explainable dynamic fusion, aligning more closely with the human fusion process than existing methods.
Paper Structure (27 sections, 5 equations, 14 figures, 9 tables)

This paper contains 27 sections, 5 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: An illustration for our multi-sensory robot system and a visualization of Modality Temporality in a multi-stage task: pouring. We show confidence in action prediction (maximum softmax score) when using the inputs of all modalities and selectively masking uni-modal features. Due to the changing importance of modalities, both evident inter-stage (coarse-grained) and minor intra-stage (fine-grained) changes in confidence are observed when masking uni-model features. The low confidence fluctuations near stage boundaries also reflect insufficient task stage understanding.
  • Figure 2: The pipeline of our method MS-Bot. It consists of four parts: feature extractor, state tokenizer, stage comprehension module, and dynamic fusion module.
  • Figure 3: An illustration of the task setup for peg insertion with keyway and pouring task.
  • Figure 4: Visualization of the aggregated attention scores for each modality and stage scores in the pouring task (Q2). At each timestep, we average the attention scores on all feature tokens of each modality separately. The stage score is the output of the gate network after softmax normalization.
  • Figure 5: Illustration of the pouring task. We randomly shift the fixed target cylinder sideways by $0\sim 3$cm (indicated by the blue arrow) in training demonstrations, and shift by $0\sim 6$cm (indicated by the orange arrow) during testing.
  • ...and 9 more figures