Table of Contents
Fetching ...

OminiAdapt: Learning Cross-Task Invariance for Robust and Environment-Aware Robotic Manipulation

Yongxu Wang, Weiyun Yi, Xinhao Kong, Wanting Li

TL;DR

The paper tackles the challenge of covariate shift in imitation learning for humanoid robot manipulation in unstructured environments. It introduces OminiAdapt, a multimodal framework combining cross-view feature fusion with CBAM-based attention, continuous object tracking for background masking, and Dynamic Adaptive Batch Normalization to rapidly adapt to new tasks. Empirical results across clothes folding, apple picking, flower arrangement, and water pouring show notable improvements over baselines HIT and ACT, with ablations confirming the importance of masking strategies, attention modules, and partial BN freezing. The approach offers a scalable, environment-aware path toward robust, autonomous manipulation, though limitations remain in multi-perspective consistency and tactile modality integration.

Abstract

With the rapid development of embodied intelligence, leveraging large-scale human data for high-level imitation learning on humanoid robots has become a focal point of interest in both academia and industry. However, applying humanoid robots to precision operation domains remains challenging due to the complexities they face in perception and control processes, the long-standing physical differences in morphology and actuation mechanisms between humanoid robots and humans, and the lack of task-relevant features obtained from egocentric vision. To address the issue of covariate shift in imitation learning, this paper proposes an imitation learning algorithm tailored for humanoid robots. By focusing on the primary task objectives, filtering out background information, and incorporating channel feature fusion with spatial attention mechanisms, the proposed algorithm suppresses environmental disturbances and utilizes a dynamic weight update strategy to significantly improve the success rate of humanoid robots in accomplishing target tasks. Experimental results demonstrate that the proposed method exhibits robustness and scalability across various typical task scenarios, providing new ideas and approaches for autonomous learning and control in humanoid robots. The project will be open-sourced on GitHub.

OminiAdapt: Learning Cross-Task Invariance for Robust and Environment-Aware Robotic Manipulation

TL;DR

The paper tackles the challenge of covariate shift in imitation learning for humanoid robot manipulation in unstructured environments. It introduces OminiAdapt, a multimodal framework combining cross-view feature fusion with CBAM-based attention, continuous object tracking for background masking, and Dynamic Adaptive Batch Normalization to rapidly adapt to new tasks. Empirical results across clothes folding, apple picking, flower arrangement, and water pouring show notable improvements over baselines HIT and ACT, with ablations confirming the importance of masking strategies, attention modules, and partial BN freezing. The approach offers a scalable, environment-aware path toward robust, autonomous manipulation, though limitations remain in multi-perspective consistency and tactile modality integration.

Abstract

With the rapid development of embodied intelligence, leveraging large-scale human data for high-level imitation learning on humanoid robots has become a focal point of interest in both academia and industry. However, applying humanoid robots to precision operation domains remains challenging due to the complexities they face in perception and control processes, the long-standing physical differences in morphology and actuation mechanisms between humanoid robots and humans, and the lack of task-relevant features obtained from egocentric vision. To address the issue of covariate shift in imitation learning, this paper proposes an imitation learning algorithm tailored for humanoid robots. By focusing on the primary task objectives, filtering out background information, and incorporating channel feature fusion with spatial attention mechanisms, the proposed algorithm suppresses environmental disturbances and utilizes a dynamic weight update strategy to significantly improve the success rate of humanoid robots in accomplishing target tasks. Experimental results demonstrate that the proposed method exhibits robustness and scalability across various typical task scenarios, providing new ideas and approaches for autonomous learning and control in humanoid robots. The project will be open-sourced on GitHub.

Paper Structure

This paper contains 14 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Results of our method in different tasks.
  • Figure 2: OminiAdapt Overview. The first frames from $N$ viewpoints are processed by the task interpreter module based on VLM and GorungdingSAM to distinctly generate an initial query frame for every view that initializes the tracking algorithm. Subsequently, the RGB video streams from all viewpoints undergo continuous object tracking, where key elements are masked to filter out background information irrelevant to the task. Then, the image features are extracted using a semi-frozen backbone with dynamic adaptive batch normalization (BN) layers, and features are enhanced through a channel-space attention module. The enhanced features, along with the embedded robot's proprioceptive states are fed into the decoder of a Transformer architecture to predict the robot's actions over the next $T$ steps, with trajectory smoothing applied for improved motion consistency.
  • Figure 3: Task Interpreter Module
  • Figure 4: This figure presents our robot's hardware configuration. The upper right shows the complete system featuring dual Realman RM75-6F robotic arms equipped with an Inspire RH56DFX dexterous hand and a LinkerHand L10 dexterous hand. The perception system comprises three Intel RealSense D435i cameras mounted on a Yunji Water2 mobile platform.
  • Figure 5: Explanation of indicators for determining whether a task is successful or not.
  • ...and 1 more figures