Table of Contents
Fetching ...

Can only LLMs do Reasoning?: Potential of Small Language Models in Task Planning

Gawon Choi, Hyemin Ahn

TL;DR

The paper investigates whether small language models can serve as effective task planners in robotics by restricting the domain and distilling chain-of-thought reasoning into GPT2-sized models via a COST dataset generated by LLMs. It introduces COST, a pipeline to generate domain objects, high-level commands, and actionable steps, along with adaptable prompt templates for domain-specific dataset creation. Experimental results in tabletop and kitchen settings show finetuned GPT2-medium can approach GPT3.5 in planning quality, highlighting the practical viability of domain-focused small LMs under real-world constraints. The work also includes a tabletop simulator and user study to compare small LMs with LLMs, arguing for a pragmatic balance between model size, latency, and task complexity in robotic planning.

Abstract

In robotics, the use of Large Language Models (LLMs) is becoming prevalent, especially for understanding human commands. In particular, LLMs are utilized as domain-agnostic task planners for high-level human commands. LLMs are capable of Chain-of-Thought (CoT) reasoning, and this allows LLMs to be task planners. However, we need to consider that modern robots still struggle to perform complex actions, and the domains where robots can be deployed are limited in practice. This leads us to pose a question: If small LMs can be trained to reason in chains within a single domain, would even small LMs be good task planners for the robots? To train smaller LMs to reason in chains, we build `COmmand-STeps datasets' (COST) consisting of high-level commands along with corresponding actionable low-level steps, via LLMs. We release not only our datasets but also the prompt templates used to generate them, to allow anyone to build datasets for their domain. We compare GPT3.5 and GPT4 with the finetuned GPT2 for task domains, in tabletop and kitchen environments, and the result shows that GPT2-medium is comparable to GPT3.5 for task planning in a specific domain. Our dataset, code, and more output samples can be found in https://github.com/Gawon-Choi/small-LMs-Task-Planning

Can only LLMs do Reasoning?: Potential of Small Language Models in Task Planning

TL;DR

The paper investigates whether small language models can serve as effective task planners in robotics by restricting the domain and distilling chain-of-thought reasoning into GPT2-sized models via a COST dataset generated by LLMs. It introduces COST, a pipeline to generate domain objects, high-level commands, and actionable steps, along with adaptable prompt templates for domain-specific dataset creation. Experimental results in tabletop and kitchen settings show finetuned GPT2-medium can approach GPT3.5 in planning quality, highlighting the practical viability of domain-focused small LMs under real-world constraints. The work also includes a tabletop simulator and user study to compare small LMs with LLMs, arguing for a pragmatic balance between model size, latency, and task complexity in robotic planning.

Abstract

In robotics, the use of Large Language Models (LLMs) is becoming prevalent, especially for understanding human commands. In particular, LLMs are utilized as domain-agnostic task planners for high-level human commands. LLMs are capable of Chain-of-Thought (CoT) reasoning, and this allows LLMs to be task planners. However, we need to consider that modern robots still struggle to perform complex actions, and the domains where robots can be deployed are limited in practice. This leads us to pose a question: If small LMs can be trained to reason in chains within a single domain, would even small LMs be good task planners for the robots? To train smaller LMs to reason in chains, we build `COmmand-STeps datasets' (COST) consisting of high-level commands along with corresponding actionable low-level steps, via LLMs. We release not only our datasets but also the prompt templates used to generate them, to allow anyone to build datasets for their domain. We compare GPT3.5 and GPT4 with the finetuned GPT2 for task domains, in tabletop and kitchen environments, and the result shows that GPT2-medium is comparable to GPT3.5 for task planning in a specific domain. Our dataset, code, and more output samples can be found in https://github.com/Gawon-Choi/small-LMs-Task-Planning
Paper Structure (17 sections, 1 equation, 10 figures, 3 tables)

This paper contains 17 sections, 1 equation, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Overview of our proposed method. When the user specifies one's domain, first, our method extract the dataset for task planning, which consists of high-level commands and low-level actionable steps, from LLMs. The dataset is used to train small LMs which will be deployed in the robot.
  • Figure 2: A flow chart describing how our COmmand-STep dataset (COST) is generated. Detailed description can be found in Sec. \ref{['sec3:proposed_approach']}.
  • Figure 3: The prompt template to generate high-level commands using the object list. We first describe the task that LLM should do, and present conditions for generating, output examples, an output template and give an object list as an input. The green lines are where the user needs to fill in.
  • Figure 4: The prompt template for generating action steps for each high-level command. The blue lines are only for the prompt where the available objects are fixed, and the black lines are common to prompts in Fig \ref{['fig5:steps_prompt_tabletop']} and \ref{['fig6:steps_prompt_kitchen']}. The green lines are where the user needs to fill in.
  • Figure 5: The prompt template for generating action steps for each high-level command. The blue lines are only for the prompt where the available objects are not fixed, and the black lines are common to prompts in Fig \ref{['fig5:steps_prompt_tabletop']} and \ref{['fig6:steps_prompt_kitchen']}. The green lines are where the user needs to fill in.
  • ...and 5 more figures