Table of Contents
Fetching ...

Chain-of-Symbol Prompting Elicits Planning in Large Langauge Models

Hanxu Hu, Hongyuan Lu, Huajian Zhang, Yun-Ze Song, Wai Lam, Yue Zhang

TL;DR

This work investigates whether large language models can improve their planning abilities when spatial environments described in natural language are represented symbolically. It introduces Chain-of-Symbol (CoS) prompting, a training-free method that converts spatial descriptions into condensed symbolic representations during intermediate reasoning, and evaluates it on the Natala benchmark (Brick World, NLVR-based Manipulation, Natural Language Navigation) plus SPARTUN SQA. Across GPT-3.5-turbo and LLAMA-2, CoS consistently surpasses Chain-of-Thought prompting in accuracy while reducing input-token usage, with notable gains up to 60.8% on Brick World and robust performance across languages and model sizes. The results suggest that symbolic representations can unlock emergent symbolic understanding in large models, enabling cheaper, more reliable spatial planning without additional training.

Abstract

In this paper, we take the initiative to investigate the performance of LLMs on complex planning tasks that require LLMs to understand a virtual spatial environment simulated via natural language and act correspondingly in text. We propose a benchmark named Natural Language Planning and Action (Natala) composed of a set of novel tasks: Brick World, NLVR-based Manipulations, and Natural Language Navigation. We found that current popular LLMs such as ChatGPT still lack abilities in complex planning. This arises a question -- do the LLMs have a good understanding of the environments described in natural language, or maybe other alternatives such as symbolic representations are neater and hence better to be understood by LLMs? To this end, we propose a novel method called CoS (Chain-of-Symbol Prompting) that represents the complex environments with condensed symbolic spatial representations during the chained intermediate thinking steps. CoS is easy to use and does not need additional training on LLMs. Extensive experiments indicate that CoS clearly surpasses the performance of the Chain-of-Thought (CoT) Prompting in all three planning tasks with even fewer tokens used in the inputs compared with CoT on ChatGPT and InstructGPT. The performance gain is strong, by up to 60.8% accuracy (from 31.8% to 92.6%) on Brick World for ChatGPT. CoS also reduces the number of tokens in the prompt obviously, by up to 65.8% of the tokens (from 407 to 139) for the intermediate steps from demonstrations on Brick World. Code and data available at: https://github.com/hanxuhu/chain-of-symbol-planning

Chain-of-Symbol Prompting Elicits Planning in Large Langauge Models

TL;DR

This work investigates whether large language models can improve their planning abilities when spatial environments described in natural language are represented symbolically. It introduces Chain-of-Symbol (CoS) prompting, a training-free method that converts spatial descriptions into condensed symbolic representations during intermediate reasoning, and evaluates it on the Natala benchmark (Brick World, NLVR-based Manipulation, Natural Language Navigation) plus SPARTUN SQA. Across GPT-3.5-turbo and LLAMA-2, CoS consistently surpasses Chain-of-Thought prompting in accuracy while reducing input-token usage, with notable gains up to 60.8% on Brick World and robust performance across languages and model sizes. The results suggest that symbolic representations can unlock emergent symbolic understanding in large models, enabling cheaper, more reliable spatial planning without additional training.

Abstract

In this paper, we take the initiative to investigate the performance of LLMs on complex planning tasks that require LLMs to understand a virtual spatial environment simulated via natural language and act correspondingly in text. We propose a benchmark named Natural Language Planning and Action (Natala) composed of a set of novel tasks: Brick World, NLVR-based Manipulations, and Natural Language Navigation. We found that current popular LLMs such as ChatGPT still lack abilities in complex planning. This arises a question -- do the LLMs have a good understanding of the environments described in natural language, or maybe other alternatives such as symbolic representations are neater and hence better to be understood by LLMs? To this end, we propose a novel method called CoS (Chain-of-Symbol Prompting) that represents the complex environments with condensed symbolic spatial representations during the chained intermediate thinking steps. CoS is easy to use and does not need additional training on LLMs. Extensive experiments indicate that CoS clearly surpasses the performance of the Chain-of-Thought (CoT) Prompting in all three planning tasks with even fewer tokens used in the inputs compared with CoT on ChatGPT and InstructGPT. The performance gain is strong, by up to 60.8% accuracy (from 31.8% to 92.6%) on Brick World for ChatGPT. CoS also reduces the number of tokens in the prompt obviously, by up to 65.8% of the tokens (from 407 to 139) for the intermediate steps from demonstrations on Brick World. Code and data available at: https://github.com/hanxuhu/chain-of-symbol-planning
Paper Structure (34 sections, 4 figures, 14 tables)

This paper contains 34 sections, 4 figures, 14 tables.

Figures (4)

  • Figure 1: An example for comparison between Chain-of-Thought (CoT) and Chain-of-Symbol (CoS) that elicits large language models in tackling complex planning tasks with higher performance and fewer input tokens. We let the model generate CoT/CoS during inference in a few-shot manner. Results were taken in May 2023 with ChatGPT and can be subject to change.
  • Figure 2: <input, Chain of Symbol, output> example triples for our three proposed tasks: Brick World, NLVR-based Manipulation, and Natural Language Navigation, and SPARTUN dataset SQA. Chains of Symbols are highlighted.
  • Figure 3: Performance of using different symbols for CoS on Brick World 1D (Shuffle Both) in accuracy.
  • Figure 4: Scaling curve of CoS and CoT of Llama-2 on three tasks.