Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture

Hong Lu; Hengxu Li; Prithviraj Singh Shahani; Stephanie Herbers; Matthias Scheutz

Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture

Hong Lu, Hengxu Li, Prithviraj Singh Shahani, Stephanie Herbers, Matthias Scheutz

TL;DR

The paper addresses reliability and interpretability gaps in vision-language-action robotics by probing the OpenVLA model to extract symbolic state representations. It trains linear probes on all $33$ hidden layers of the $Llama$-$2$ $7$B backbone to predict object-state atoms (224) and action-state atoms (12), and integrates the predicted predicates into the DIARC cognitive architecture. A real-time DIARC--OpenVLA integration with a WebSocket pipeline and a React UI enables symbolic state monitoring during $10$ LIBERO-spatial pick-and-place tasks. Results show consistently high probe accuracies ($>0.90$) across most layers, though the expected early encoding of object states relative to action states was not observed, underscoring the need for more diverse data; the work lays a foundation for interpretable, robust robotic manipulation by combining CA with VLA.

Abstract

Vision-language-action (VLA) models hold promise as generalist robotics solutions by translating visual and linguistic inputs into robot actions, yet they lack reliability due to their black-box nature and sensitivity to environmental changes. In contrast, cognitive architectures (CA) excel in symbolic reasoning and state monitoring but are constrained by rigid predefined execution. This work bridges these approaches by probing OpenVLA's hidden layers to uncover symbolic representations of object properties, relations, and action states, enabling integration with a CA for enhanced interpretability and robustness. Through experiments on LIBERO-spatial pick-and-place tasks, we analyze the encoding of symbolic states across different layers of OpenVLA's Llama backbone. Our probing results show consistently high accuracies (> 0.90) for both object and action states across most layers, though contrary to our hypotheses, we did not observe the expected pattern of object states being encoded earlier than action states. We demonstrate an integrated DIARC-OpenVLA system that leverages these symbolic representations for real-time state monitoring, laying the foundation for more interpretable and reliable robotic manipulation.

Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture

TL;DR

Abstract

Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)