Table of Contents
Fetching ...

AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation

Yusuke Takagi, Motonari Kambara, Daichi Yashima, Koki Seno, Kento Tokura, Komei Sugiura

Abstract

In this study, we address the problem of language-guided robotic manipulation, where a robot is required to manipulate a wide range of objects based on visual observations and natural language instructions. This task is essential for service robots that operate in human environments, and requires safety, efficiency, and task-level generality. Although Vision-Language-Action models (VLAs) have demonstrated strong performance for this task, their deployment in resource-constrained environments remains challenging because of the computational cost of standard transformer backbones. To overcome this limitation, we propose AnoleVLA, a lightweight VLA that uses a deep state space model to process multimodal sequences efficiently. The model leverages its lightweight and fast sequential state modeling to process visual and textual inputs, which allows the robot to generate trajectories efficiently. We evaluated the proposed method in both simulation and physical experiments. Notably, in real-world evaluations, AnoleVLA outperformed a representative large-scale VLA by 21 points for the task success rate while achieving an inference speed approximately three times faster.

AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation

Abstract

In this study, we address the problem of language-guided robotic manipulation, where a robot is required to manipulate a wide range of objects based on visual observations and natural language instructions. This task is essential for service robots that operate in human environments, and requires safety, efficiency, and task-level generality. Although Vision-Language-Action models (VLAs) have demonstrated strong performance for this task, their deployment in resource-constrained environments remains challenging because of the computational cost of standard transformer backbones. To overcome this limitation, we propose AnoleVLA, a lightweight VLA that uses a deep state space model to process multimodal sequences efficiently. The model leverages its lightweight and fast sequential state modeling to process visual and textual inputs, which allows the robot to generate trajectories efficiently. We evaluated the proposed method in both simulation and physical experiments. Notably, in real-world evaluations, AnoleVLA outperformed a representative large-scale VLA by 21 points for the task success rate while achieving an inference speed approximately three times faster.
Paper Structure (23 sections, 6 equations, 6 figures, 4 tables)

This paper contains 23 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of AnoleVLA and real-world performance. (Left) Our deep SSM backbone processes language instructions and robot observations to generate trajectories, leveraging a two-stage training strategy with the acceleration loss for smooth control. (Right) Physical experiment results. The horizontal and vertical axes represent inference speed and the average success rate, respectively. AnoleVLA achieved the highest overall success rate. Notably, compared with $\pi_{0.5}$, AnoleVLA not only yielded superior task performance but also demonstrated an inference speed approximately three times faster.
  • Figure 2: Typical scene of the language-guided manipulation task. The frames illustrate the robot's execution in chronological order from left to right. In this scene, the robot is given the instruction, "Place the apple into the red bowl." The robot successfully grasps the apple from among multiple fruits and places it in the bowl.
  • Figure 3: Model architecture of AnoleVLA. Multimodal tokens (proprioception, state delta, vision, and language) are concatenated and processed by a Mamba backbone, and the final token predicts an $H$-step action chunk. The two-stage training supervises both velocities and their temporal differences to improve execution smoothness. In this figure, $\bm{s}^{(t)}$, $\Delta \bm{s}^{(t)}$, $\bm{x}^{(t)}_v$, and $\bm{x}_l$ represent the state, state delta, visual observation, and natural language instruction at time step $t$, respectively. On the right-hand side, 'pred.' and 'GT' denote the predicted outputs and the corresponding ground truth, respectively. Specifically, $\hat{\bm{y}}$ and $\bm{y}$ represent the predicted future actions and their ground truth, respectively. Furthermore, $\Delta \hat{\bm{y}}$ and $\Delta \bm{y}$ denote the temporal differences of $\hat{\bm{y}}$ and $\bm{y}$, respectively.
  • Figure 4: Qualitative results of AnoleVLA and a baseline method. For each example, we display the initial observation $\bm{x}_v^{(0)}$ and the chronological sequence of observations $\bm{x}_v^{(t)}$ during task execution. In each example, the robot executes the task based on the following instructions: (a) "Pick and place a puck onto a shelf." (b) "Hammer a screw on the wall." (c) "Push the puck to a goal."
  • Figure 5: (a) Robot platform and experimental environment used in the physical experiments. We used the Human Support Robot (HSR) Yamamoto2019hsr as the robot platform. (b) Everyday objects used in the experiments. We used standard YCB objects calli2015benchmarking for manipulation research, along with additional objects to increase diversity in appearance and size.
  • ...and 1 more figures