Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang; Shangru Li; Shuhan Wang; Xuanyang Xi; Dingkang Liang; Xiang Bai

Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang, Xiang Bai

Abstract

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.

Towards Generalizable Robotic Manipulation in Dynamic Environments

Abstract

Paper Structure (40 sections, 5 equations, 6 figures, 10 tables)

This paper contains 40 sections, 5 equations, 6 figures, 10 tables.

Introduction
DOMINO Dataset
Task Definition
Data Construction
Data Characteristics
Spatiotemporal Task Taxonomy.
Hierarchical Dynamic Complexity.
Comprehensive Evaluation Metrics.
Dynamic-Aware VLA
Scene-Centric Spatiotemporal Dynamics Encoding
Object-Centric Dynamic Representation
Training Strategy
Experiment
Experimental Setup
Benchmarks.
...and 25 more sections

Figures (6)

Figure 1: (a) Illustration of the defined dynamic difficulty levels, progressing from static (Level 0) to stochastic and abrupt dynamics (Level 3). (b) Dynamic awareness requires capturing historical context and anticipating future motion. (c) Performance of SOTA models degrades when shifting from static to dynamic environments.
Figure 2: Dataset Visualization. We present DOMINO dataset of 117,000 dynamic manipulation trajectories, covering 35 distinct tasks across five robot embodiments.
Figure 3: PUMA processes historical motion flows, current observations, and instructions to encode scene-centric historical dynamics. It employs a dual-query mechanism where Action Queries decode continuous action chunks and World Queries aggregate dynamic representations. During training, world queries are supervised via a similarity loss against future features extracted by DINO to predict object-centric dynamics.
Figure 4: Performance degradation of the ACT model across three dynamic complexity.
Figure 5: PUMA performs significantly better than other baselines on difficult tasks.
...and 1 more figures

Towards Generalizable Robotic Manipulation in Dynamic Environments

Abstract

Towards Generalizable Robotic Manipulation in Dynamic Environments

Authors

Abstract

Table of Contents

Figures (6)