Table of Contents
Fetching ...

M2R2: MultiModal Robotic Representation for Temporal Action Segmentation

Daniel Sliwowski, Dongheui Lee

TL;DR

M2R2 presents a model-agnostic, multimodal feature extractor for temporal action segmentation that fuses exteroceptive (vision, audio) and proprioceptive data into a common embedding, decoupling feature extraction from the TAS model. A BRPrompt-like pretraining strategy trains the Fusion Transformer to align window representations with action-order descriptions and to detect boundaries, using a joint loss that combines action-order alignment and boundary regression. On the REASSEMBLE dataset, M2R2 features enable state-of-the-art performance across strong TAS baselines, with substantial improvements over vision-only or proprioception-only approaches, and ablations confirm the complementary value of each modality. This work enables easier reuse of learned features across TAS architectures and suggests that integrating audio and rich proprioceptive cues with vision yields robust, fine-grained action segmentation in contact-rich robotic tasks.

Abstract

Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel pretraining strategy that enables the reuse of learned features across multiple TAS models. Our method achieves state-of-the-art performance on the REASSEMBLE dataset, a challenging multimodal robotic assembly dataset, outperforming existing robotic action segmentation models by 46.6%. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.

M2R2: MultiModal Robotic Representation for Temporal Action Segmentation

TL;DR

M2R2 presents a model-agnostic, multimodal feature extractor for temporal action segmentation that fuses exteroceptive (vision, audio) and proprioceptive data into a common embedding, decoupling feature extraction from the TAS model. A BRPrompt-like pretraining strategy trains the Fusion Transformer to align window representations with action-order descriptions and to detect boundaries, using a joint loss that combines action-order alignment and boundary regression. On the REASSEMBLE dataset, M2R2 features enable state-of-the-art performance across strong TAS baselines, with substantial improvements over vision-only or proprioception-only approaches, and ablations confirm the complementary value of each modality. This work enables easier reuse of learned features across TAS architectures and suggests that integrating audio and rich proprioceptive cues with vision yields robust, fine-grained action segmentation in contact-rich robotic tasks.

Abstract

Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel pretraining strategy that enables the reuse of learned features across multiple TAS models. Our method achieves state-of-the-art performance on the REASSEMBLE dataset, a challenging multimodal robotic assembly dataset, outperforming existing robotic action segmentation models by 46.6%. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.

Paper Structure

This paper contains 17 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The overview of mutimodal temporal action segmentation (TAS) using the proposed M2R2 representation, when applying to contact-rich assembly and disassembly task.
  • Figure 2: M2R2 Model Architecture. To compute the multimodal feature at time instant $t_i$, we first process each modality separately to obtain image features $I_i$, audio features $A_i$, and proprioceptive features $\{S_i^s\}_{s=1}^{N_s}$, which are later fused using a Transformer encoder layer followed by an MLP. To obtain $I_i$, we use the ActionCLIP image encoder ActionCLIP. For $A_i$, we extract features using the Audio Spectrogram Transformer gong2021ast. For the proprioceptive data, we compute $S_i^s$ by first upscaling the raw sensory data through a linear projection into a higher-dimensional space using a learnable projection matrix $W^s_p$. Next, we embed temporal information by applying an element-wise multiplication with a learnable temporal embedding matrix $E^s_t$. Finally, we compute the average over the temporal dimension to obtain $S_i^s$.
  • Figure 3: Temporal Fusion and Pretraining. Given a window $[i_b - p, i_e + p)$, we sample $N_w$ frames and extract features using our M2R2 feature extractor. A Temporal Fusion Transformer refines these features into $\widehat{X}$, which we average to obtain the window representation $E_w$. To learn action order, we minimize the distance between $E_w$ and a textual embedding $E_s$ generated from action labels by using a template. To enhance boundary detection, we minimize the MSE between the smoothed ground truth boundaries $\hat{B}_{gt}$ (obtained from the frame-wise dataset annotations) and predictions $B_{pred}$, obtained via a Boundary Regression Network.
  • Figure 4: Quantitative evaluation of different baseline TAS models. AWE shi2023waypointbased performs poor in sections of highly nonlinear movement, while BOCPD BOCPD tends to over-segment in areas with high variation in force interactions (marked in black). The deep learning-based approaches occasionally misidentify objects.
  • Figure 5: Coarse level prediction compared to fine-grain level prediction. Same recording as in Figure \ref{['fig:qual']}.
  • ...and 1 more figures