Table of Contents
Fetching ...

NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models

Ziyue Zhu, Shangyang Wu, Shuai Zhao, Zhiqiu Zhao, Shengjie Li, Yi Wang, Fang Li, Haoran Luo

TL;DR

Experiments on robotic manipulation benchmarks demonstrate that NS-VLA outperforms previous methods in both one-shot training and data-perturbed settings, while simultaneously exhibiting superior zero-shot generalizability, high data efficiency and expanded exploration space.

Abstract

Vision-Language-Action (VLA) models are formulated to ground instructions in visual context and generate action sequences for robotic manipulation. Despite recent progress, VLA models still face challenges in learning related and reusable primitives, reducing reliance on large-scale data and complex architectures, and enabling exploration beyond demonstrations. To address these challenges, we propose a novel Neuro-Symbolic Vision-Language-Action (NS-VLA) framework via online reinforcement learning (RL). It introduces a symbolic encoder to embedding vision and language features and extract structured primitives, utilizes a symbolic solver for data-efficient action sequencing, and leverages online RL to optimize generation via expansive exploration. Experiments on robotic manipulation benchmarks demonstrate that NS-VLA outperforms previous methods in both one-shot training and data-perturbed settings, while simultaneously exhibiting superior zero-shot generalizability, high data efficiency and expanded exploration space. Our code is available.

NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models

TL;DR

Experiments on robotic manipulation benchmarks demonstrate that NS-VLA outperforms previous methods in both one-shot training and data-perturbed settings, while simultaneously exhibiting superior zero-shot generalizability, high data efficiency and expanded exploration space.

Abstract

Vision-Language-Action (VLA) models are formulated to ground instructions in visual context and generate action sequences for robotic manipulation. Despite recent progress, VLA models still face challenges in learning related and reusable primitives, reducing reliance on large-scale data and complex architectures, and enabling exploration beyond demonstrations. To address these challenges, we propose a novel Neuro-Symbolic Vision-Language-Action (NS-VLA) framework via online reinforcement learning (RL). It introduces a symbolic encoder to embedding vision and language features and extract structured primitives, utilizes a symbolic solver for data-efficient action sequencing, and leverages online RL to optimize generation via expansive exploration. Experiments on robotic manipulation benchmarks demonstrate that NS-VLA outperforms previous methods in both one-shot training and data-perturbed settings, while simultaneously exhibiting superior zero-shot generalizability, high data efficiency and expanded exploration space. Our code is available.
Paper Structure (42 sections, 3 theorems, 39 equations, 17 figures, 1 table)

This paper contains 42 sections, 3 theorems, 39 equations, 17 figures, 1 table.

Key Result

Proposition 4.1

The plan-constrained update makes $m_t$ monotone with step size $\le 1$, stabilizing segmentation and reducing flicker; deterministic tie-breaking yields a unique segmentation under repeats.

Figures (17)

  • Figure 1: An example of the NS-VLA pipeline to execute instruction-conditioned manipulation by orchestrating symbolic primitives and sparse action chunks.
  • Figure 2: Success rate comparison under three training settings: (i) training on full demonstrations and testing on LIBERO, (ii) 1-shot training (one demonstration per task) and testing on LIBERO, and (iii) training on full demonstrations and testing on LIBERO-Plus. While most baselines achieve high SR with full-demo training, their performance drops sharply under low-data training and generalization tasks. In contrast, NS-VLA maintains a consistently high success rate with minimal performance degradation.
  • Figure 3: (a) Examples of converting LIBERO instruction clauses into primitives and (b) the primitive distribution.
  • Figure 4: Overview of the NS-VLA framework: an RL-optimized neuro-symbolic policy for Robotic manipulation, where the agent hierarchically orchestrates visual grounding, symbolic primitive inference, and continuous action chunking.
  • Figure 5:
  • ...and 12 more figures

Theorems & Definitions (6)

  • Proposition 4.1
  • proof
  • Proposition 4.2
  • proof
  • Proposition 4.3
  • proof