Table of Contents
Fetching ...

DECO: Decoupled Multimodal Diffusion Transformer for Bimanual Dexterous Manipulation with a Plugin Tactile Adapter

Xukun Li, Yu Sun, Lei Zhang, Bosheng Huang, Yibo Peng, Yuan Meng, Haojun Jiang, Shaoxuan Xie, Guacai Yao, Alois Knoll, Zhenshan Bing, Xinlong Wang, Zhenguo Sun

TL;DR

DECO is a DiT-based policy that decouples multimodal conditioning and is accompanied by DECO-50, a bimanual dexterous manipulation dataset with tactile sensing, consisting of 4 scenarios and 28 sub-tasks.

Abstract

Overview of the Proposed DECO Framework.} DECO is a DiT-based policy that decouples multimodal conditioning. Image and action tokens interact via joint self attention, while proprioceptive states and optional conditions are injected through adaptive layer normalization. Tactile signals are injected via cross attention, while a lightweight LoRA-based adapter is used to efficiently fine-tune the pretrained policy. DECO is also accompanied by DECO-50, a bimanual dexterous manipulation dataset with tactile sensing, consisting of 4 scenarios and 28 sub-tasks, covering more than 50 hours of data, approximately 5 million frames, and 8,000 successful trajectories.

DECO: Decoupled Multimodal Diffusion Transformer for Bimanual Dexterous Manipulation with a Plugin Tactile Adapter

TL;DR

DECO is a DiT-based policy that decouples multimodal conditioning and is accompanied by DECO-50, a bimanual dexterous manipulation dataset with tactile sensing, consisting of 4 scenarios and 28 sub-tasks.

Abstract

Overview of the Proposed DECO Framework.} DECO is a DiT-based policy that decouples multimodal conditioning. Image and action tokens interact via joint self attention, while proprioceptive states and optional conditions are injected through adaptive layer normalization. Tactile signals are injected via cross attention, while a lightweight LoRA-based adapter is used to efficiently fine-tune the pretrained policy. DECO is also accompanied by DECO-50, a bimanual dexterous manipulation dataset with tactile sensing, consisting of 4 scenarios and 28 sub-tasks, covering more than 50 hours of data, approximately 5 million frames, and 8,000 successful trajectories.
Paper Structure (20 sections, 4 equations, 8 figures, 14 tables)

This paper contains 20 sections, 4 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: Overview of the Proposed DECO Framework. DECO is a DiT-based policy that decouples multimodal conditioning. Image and action tokens interact via joint self attention, while proprioceptive states and optional conditions are injected through adaptive layer normalization. Tactile signals are injected via cross attention, while a lightweight LoRA-based adapter is used to efficiently fine-tune the pretrained policy. DECO is also accompanied by DECO-50, a bimanual dexterous manipulation dataset with tactile sensing, consisting of 4 scenarios and 28 sub-tasks, covering more than 50 hours of data, approximately 5 million frames, and 8,000 successful trajectories.
  • Figure 2: Two-Stage Training Paradigm for DECO. In the first stage, a vision–action policy is trained with images, proprioceptive states and task-level conditions. In the second stage, the pretrained policy is frozen, and tactile signals are incorporated via a lightweight adapter and cross attention, enabling parameter-efficient adaptation to tactile-aware manipulation without retraining the entire model.
  • Figure 3: Multimodal Diffusion Transformer Block with Decoupled Conditioning. Images via self-attention, proprioceptive states via AdaLN, and tactile signals via cross-attention, enabling independent and efficient integration of each modality.
  • Figure 4: Plugin Tactile Adapter. Raw tactile information is encoded by the tactile encoder and integrated into the pretrained policy via LoRA for efficient adaptation.
  • Figure 5: Task Illustration. DECO-50 dataset comprises four scenarios, each consisting of multiple sub-tasks.
  • ...and 3 more figures