Table of Contents
Fetching ...

Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture

Hong Lu, Hengxu Li, Prithviraj Singh Shahani, Stephanie Herbers, Matthias Scheutz

TL;DR

The paper addresses reliability and interpretability gaps in vision-language-action robotics by probing the OpenVLA model to extract symbolic state representations. It trains linear probes on all $33$ hidden layers of the $Llama$-$2$ $7$B backbone to predict object-state atoms (224) and action-state atoms (12), and integrates the predicted predicates into the DIARC cognitive architecture. A real-time DIARC--OpenVLA integration with a WebSocket pipeline and a React UI enables symbolic state monitoring during $10$ LIBERO-spatial pick-and-place tasks. Results show consistently high probe accuracies ($>0.90$) across most layers, though the expected early encoding of object states relative to action states was not observed, underscoring the need for more diverse data; the work lays a foundation for interpretable, robust robotic manipulation by combining CA with VLA.

Abstract

Vision-language-action (VLA) models hold promise as generalist robotics solutions by translating visual and linguistic inputs into robot actions, yet they lack reliability due to their black-box nature and sensitivity to environmental changes. In contrast, cognitive architectures (CA) excel in symbolic reasoning and state monitoring but are constrained by rigid predefined execution. This work bridges these approaches by probing OpenVLA's hidden layers to uncover symbolic representations of object properties, relations, and action states, enabling integration with a CA for enhanced interpretability and robustness. Through experiments on LIBERO-spatial pick-and-place tasks, we analyze the encoding of symbolic states across different layers of OpenVLA's Llama backbone. Our probing results show consistently high accuracies (> 0.90) for both object and action states across most layers, though contrary to our hypotheses, we did not observe the expected pattern of object states being encoded earlier than action states. We demonstrate an integrated DIARC-OpenVLA system that leverages these symbolic representations for real-time state monitoring, laying the foundation for more interpretable and reliable robotic manipulation.

Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture

TL;DR

The paper addresses reliability and interpretability gaps in vision-language-action robotics by probing the OpenVLA model to extract symbolic state representations. It trains linear probes on all hidden layers of the - B backbone to predict object-state atoms (224) and action-state atoms (12), and integrates the predicted predicates into the DIARC cognitive architecture. A real-time DIARC--OpenVLA integration with a WebSocket pipeline and a React UI enables symbolic state monitoring during LIBERO-spatial pick-and-place tasks. Results show consistently high probe accuracies () across most layers, though the expected early encoding of object states relative to action states was not observed, underscoring the need for more diverse data; the work lays a foundation for interpretable, robust robotic manipulation by combining CA with VLA.

Abstract

Vision-language-action (VLA) models hold promise as generalist robotics solutions by translating visual and linguistic inputs into robot actions, yet they lack reliability due to their black-box nature and sensitivity to environmental changes. In contrast, cognitive architectures (CA) excel in symbolic reasoning and state monitoring but are constrained by rigid predefined execution. This work bridges these approaches by probing OpenVLA's hidden layers to uncover symbolic representations of object properties, relations, and action states, enabling integration with a CA for enhanced interpretability and robustness. Through experiments on LIBERO-spatial pick-and-place tasks, we analyze the encoding of symbolic states across different layers of OpenVLA's Llama backbone. Our probing results show consistently high accuracies (> 0.90) for both object and action states across most layers, though contrary to our hypotheses, we did not observe the expected pattern of object states being encoded earlier than action states. We demonstrate an integrated DIARC-OpenVLA system that leverages these symbolic representations for real-time state monitoring, laying the foundation for more interpretable and reliable robotic manipulation.

Paper Structure

This paper contains 21 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: The DIARC - VLA - Probe System. The user selects a natural language command in DIARC's Graphical User Interface (GUI). The VLAComponent in DIARC sends this command to OpenVLA. The probes receive two hidden layers' activations in OpenVLA's Llama backbone that encode the most object state and action state information respectively. The two best hidden layers are identified through the probing experiment described in Section \ref{['sec:probing-experiment']}. The probes predict the object state and the action state based on the hidden layers' activations at each timestep. The VLAComponent in DIARC updates DIARC's beliefs based on the predicted object state and action state.
  • Figure 2: DIARC--OpenVLA GUI. The left-hand pane displays the real-time camera feed (updated at 5--10 Hz), showing the robot’s manipulation progress. The right-hand pane color-codes each predicted symbolic state (green for newly activated, red for deactivated), letting users quickly verify whether OpenVLA’s internal representation matches the environment. After task completion, a timeline slider appears, allowing the user to revisit earlier steps’ images and states for deeper analysis.
  • Figure 3: Example Labeled Object States and Action States in a Pick-and-Place Trajectory. Object states are shown in green and action states are shown in blue.
  • Figure 4: Probing Results. The first seven columns are object state symbols and the last two columns are action state symbols.