Table of Contents
Fetching ...

AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning

Alan Dao, Dinh Bach Vu, Bui Quang Huy

TL;DR

AlphaSpace addresses the challenge of enabling 3D spatial reasoning in large language models for robotic manipulation by introducing a semantics-based tokenization scheme that encodes height and 3D coordinates, complemented by synthetic symbolic reasoning data. It trains a decoder-only model on a large, richly annotated synthetic dataset and demonstrates substantial gains on the EmbodiedBench benchmark, achieving 66.67% total accuracy compared with GPT-4o and Claude-3.5 Sonnet. Importantly, the approach avoids heavy reliance on vision-based encoders, offering a lightweight, structured alternative that improves generalization for tabletop manipulation tasks. This work highlights the value of structured spatial representations and symbolic supervision for embodied AI and motivates future RL and hybrid sensing extensions.

Abstract

This paper presents AlphaSpace, a novel methodology designed to enhance the spatial reasoning capabilities of language models for robotic manipulation in 3D Cartesian space. AlphaSpace employs a hierarchical semantics-based tokenization strategy that encodes spatial information at both coarse and fine-grained levels. Our approach represents objects with their attributes, positions, and height information through structured tokens, enabling precise spatial reasoning without relying on traditional vision-based embeddings. This approach enables LLMs to accurately manipulate objects by positioning them at specific (x, y, z) coordinates. Experimental results suggest that AlphaSpace demonstrates promising potential for improving manipulation tasks, achieving a total accuracy of 66.67%, compared to 37.5% for GPT-4o and 29.17% for Claude 3.5 Sonnet. These results demonstrate the potential of structured spatial encoding for manipulation tasks and warrant further exploration.

AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning

TL;DR

AlphaSpace addresses the challenge of enabling 3D spatial reasoning in large language models for robotic manipulation by introducing a semantics-based tokenization scheme that encodes height and 3D coordinates, complemented by synthetic symbolic reasoning data. It trains a decoder-only model on a large, richly annotated synthetic dataset and demonstrates substantial gains on the EmbodiedBench benchmark, achieving 66.67% total accuracy compared with GPT-4o and Claude-3.5 Sonnet. Importantly, the approach avoids heavy reliance on vision-based encoders, offering a lightweight, structured alternative that improves generalization for tabletop manipulation tasks. This work highlights the value of structured spatial representations and symbolic supervision for embodied AI and motivates future RL and hybrid sensing extensions.

Abstract

This paper presents AlphaSpace, a novel methodology designed to enhance the spatial reasoning capabilities of language models for robotic manipulation in 3D Cartesian space. AlphaSpace employs a hierarchical semantics-based tokenization strategy that encodes spatial information at both coarse and fine-grained levels. Our approach represents objects with their attributes, positions, and height information through structured tokens, enabling precise spatial reasoning without relying on traditional vision-based embeddings. This approach enables LLMs to accurately manipulate objects by positioning them at specific (x, y, z) coordinates. Experimental results suggest that AlphaSpace demonstrates promising potential for improving manipulation tasks, achieving a total accuracy of 66.67%, compared to 37.5% for GPT-4o and 29.17% for Claude 3.5 Sonnet. These results demonstrate the potential of structured spatial encoding for manipulation tasks and warrant further exploration.

Paper Structure

This paper contains 19 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Put black cube onto green cube
  • Figure 2: Performance Comparison on EmbodiedBench Manipulation Subtask