Chain-of-Symbol Prompting Elicits Planning in Large Langauge Models
Hanxu Hu, Hongyuan Lu, Huajian Zhang, Yun-Ze Song, Wai Lam, Yue Zhang
TL;DR
This work investigates whether large language models can improve their planning abilities when spatial environments described in natural language are represented symbolically. It introduces Chain-of-Symbol (CoS) prompting, a training-free method that converts spatial descriptions into condensed symbolic representations during intermediate reasoning, and evaluates it on the Natala benchmark (Brick World, NLVR-based Manipulation, Natural Language Navigation) plus SPARTUN SQA. Across GPT-3.5-turbo and LLAMA-2, CoS consistently surpasses Chain-of-Thought prompting in accuracy while reducing input-token usage, with notable gains up to 60.8% on Brick World and robust performance across languages and model sizes. The results suggest that symbolic representations can unlock emergent symbolic understanding in large models, enabling cheaper, more reliable spatial planning without additional training.
Abstract
In this paper, we take the initiative to investigate the performance of LLMs on complex planning tasks that require LLMs to understand a virtual spatial environment simulated via natural language and act correspondingly in text. We propose a benchmark named Natural Language Planning and Action (Natala) composed of a set of novel tasks: Brick World, NLVR-based Manipulations, and Natural Language Navigation. We found that current popular LLMs such as ChatGPT still lack abilities in complex planning. This arises a question -- do the LLMs have a good understanding of the environments described in natural language, or maybe other alternatives such as symbolic representations are neater and hence better to be understood by LLMs? To this end, we propose a novel method called CoS (Chain-of-Symbol Prompting) that represents the complex environments with condensed symbolic spatial representations during the chained intermediate thinking steps. CoS is easy to use and does not need additional training on LLMs. Extensive experiments indicate that CoS clearly surpasses the performance of the Chain-of-Thought (CoT) Prompting in all three planning tasks with even fewer tokens used in the inputs compared with CoT on ChatGPT and InstructGPT. The performance gain is strong, by up to 60.8% accuracy (from 31.8% to 92.6%) on Brick World for ChatGPT. CoS also reduces the number of tokens in the prompt obviously, by up to 65.8% of the tokens (from 407 to 139) for the intermediate steps from demonstrations on Brick World. Code and data available at: https://github.com/hanxuhu/chain-of-symbol-planning
