Table of Contents
Fetching ...

DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation

Xiongfeng Peng, Jiaqian Yu, Dingzhe Li, Yixiang Jin, Lu Xu, Yamin Mao, Chao Zhang, Weiming Li, Sujin Jang, Dongwook Lee, Daehyun Ji

TL;DR

DAM-VLA introduces an action routing mechanism, a dynamic action model-based VLA framework that fuses high-level VLM cognition with low-level visual features to predict actions, and a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models.

Abstract

In dynamic environments such as warehouses, hospitals, and homes, robots must seamlessly transition between gross motion and precise manipulations to complete complex tasks. However, current Vision-Language-Action (VLA) frameworks, largely adapted from pre-trained Vision-Language Models (VLMs), often struggle to reconcile general task adaptability with the specialized precision required for intricate manipulation. To address this challenge, we propose DAM-VLA, a dynamic action model-based VLA framework. DAM-VLA integrates VLM reasoning with diffusion-based action models specialized for arm and gripper control. Specifically, it introduces (i) an action routing mechanism, using task-specific visual and linguistic cues to select appropriate action models (e.g., arm movement or gripper manipulation), (ii) a dynamic action model that fuses high-level VLM cognition with low-level visual features to predict actions, and (iii) a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models. Across extensive evaluations, DAM-VLA achieves superior success rates compared to state-of-the-art VLA methods in simulated (SIMPLER, FurnitureBench) and real-world settings, showing robust generalization from standard pick-and-place to demanding long-horizon and contact-rich tasks.

DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation

TL;DR

DAM-VLA introduces an action routing mechanism, a dynamic action model-based VLA framework that fuses high-level VLM cognition with low-level visual features to predict actions, and a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models.

Abstract

In dynamic environments such as warehouses, hospitals, and homes, robots must seamlessly transition between gross motion and precise manipulations to complete complex tasks. However, current Vision-Language-Action (VLA) frameworks, largely adapted from pre-trained Vision-Language Models (VLMs), often struggle to reconcile general task adaptability with the specialized precision required for intricate manipulation. To address this challenge, we propose DAM-VLA, a dynamic action model-based VLA framework. DAM-VLA integrates VLM reasoning with diffusion-based action models specialized for arm and gripper control. Specifically, it introduces (i) an action routing mechanism, using task-specific visual and linguistic cues to select appropriate action models (e.g., arm movement or gripper manipulation), (ii) a dynamic action model that fuses high-level VLM cognition with low-level visual features to predict actions, and (iii) a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models. Across extensive evaluations, DAM-VLA achieves superior success rates compared to state-of-the-art VLA methods in simulated (SIMPLER, FurnitureBench) and real-world settings, showing robust generalization from standard pick-and-place to demanding long-horizon and contact-rich tasks.
Paper Structure (9 sections, 4 equations, 4 figures, 2 tables)

This paper contains 9 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: DAM-VLA framework and experimental results. (a) We propose a DAM-VLA framework that dynamically integrates the inherent reasoning capabilities of VLMs with specialized diffusion-based action models tailored for arm movement and gripper manipulation. In various robotic tasks, arm movement typically covers a larger spatial range than gripper manipulation. consequently, in the observed images, the trajectories of the arm movement often occupy the majority of the region, while gripper manipulation is usually confined to a small, localized area; (b) Across extensive evaluations, our DAM-VLA achieves superior average success rates compared to state-of-the-art VLA methods, demonstrating improvements in both pick-and-place tasks within the SIMPLER simulation and long-horizon tasks on the FurnitureBench simulation, as well as in real-world pick-and-place evaluations.
  • Figure 2: We identify three distinctions between the arm movement and the gripper manipulation using the task of placing a carrot on a plate as an illustrative example: Path Constrains, Visual Attention, and Dataset Representation.
  • Figure 3: The architecture of our DAM-VLA. Given an RGB image observation and a task description, the model predicts a sequence of temporal actions. The process consists of three key components: 1) a vision-language model that encodes observation into visual, class and register tokens, and integrates visual tokens with a set of linguistic tokens, and produces the cognition and reasoning latents; 2) an action routing module that generates a weight and feeds it into the dynamic action model; 3) a dynamic action model that dynamically executes different action models by combining the low-level class token or register token from the vision model with the high-level cognition latent from the VLM to predict an action sequence.
  • Figure 4: Illustration of the dual-scale action weighting mechanism. The trajectory weight highlights critical manipulation phases via asymmetrical Gaussian distributions. Within each predicted chunk, the action chunk weight applies exponential decay to prioritize immediate temporal accuracy. The final weight integrates both scales to guide model supervision.