Table of Contents
Fetching ...

Self-CriTeach: LLM Self-Teaching and Self-Critiquing for Improving Robotic Planning via Automated Domain Generation

Jinbang Huang, Zhiyuan Li, Yuanzhao Hu, Zhanguang Zhang, Mark Coates, Xingyue Quan, Yingxue Zhang

TL;DR

Self-CriTeach presents a unified self-improvement loop for LLM-based robotic planning by having the model generate PDDL planning domains that serve as scalable supervision data and structured rewards. It automatically refines these domains, transforms symbolic plans into chain-of-thought traces for supervised fine-tuning, and then uses the same domains as reward signals for reinforcement learning, enabling the model to internalize symbolic planning. The approach yields higher planning success, better cross-task generalization, and lower token costs, with demonstrated robustness to imperfect logical states in real-robot experiments. This framework provides a practical pathway to integrate symbolic planning with learning in robotics, reducing manual annotation and reward engineering while improving long-horizon decision making.

Abstract

Large Language Models (LLMs) have recently shown strong promise for robotic task planning, particularly through automatic planning domain generation. Planning domains are brittle under imperfect logical states and perception noise; prior approaches largely treat generated planning domains as plan utilities, overlooking their potential as scalable sources of reasoning supervision and structured reward signals. At the same time, reasoning LLMs depend on chain-of-thought (CoT) supervision that is expensive to collect for robotic tasks, and reinforcement learning (RL) faces challenges in reward engineering. We propose Self-CriTeach, an LLM self-teaching and self-critiquing framework in which an LLM autonomously generates symbolic planning domains that serve a dual role: (i) enabling large-scale generation of robotic planning problem-plan pairs, and (ii) providing structured reward functions. First, the self-written domains enable large-scale generation of symbolic task plans, which are automatically transformed into extended CoT trajectories for supervised fine-tuning. Second, the self-written domains are reused as structured reward functions, providing dense feedback for reinforcement learning without manual reward engineering. This unified training pipeline yields a planning-enhanced LLM with higher planning success rates, stronger cross-task generalization, reduced inference cost, and improved robustness to imperfect logical states.

Self-CriTeach: LLM Self-Teaching and Self-Critiquing for Improving Robotic Planning via Automated Domain Generation

TL;DR

Self-CriTeach presents a unified self-improvement loop for LLM-based robotic planning by having the model generate PDDL planning domains that serve as scalable supervision data and structured rewards. It automatically refines these domains, transforms symbolic plans into chain-of-thought traces for supervised fine-tuning, and then uses the same domains as reward signals for reinforcement learning, enabling the model to internalize symbolic planning. The approach yields higher planning success, better cross-task generalization, and lower token costs, with demonstrated robustness to imperfect logical states in real-robot experiments. This framework provides a practical pathway to integrate symbolic planning with learning in robotics, reducing manual annotation and reward engineering while improving long-horizon decision making.

Abstract

Large Language Models (LLMs) have recently shown strong promise for robotic task planning, particularly through automatic planning domain generation. Planning domains are brittle under imperfect logical states and perception noise; prior approaches largely treat generated planning domains as plan utilities, overlooking their potential as scalable sources of reasoning supervision and structured reward signals. At the same time, reasoning LLMs depend on chain-of-thought (CoT) supervision that is expensive to collect for robotic tasks, and reinforcement learning (RL) faces challenges in reward engineering. We propose Self-CriTeach, an LLM self-teaching and self-critiquing framework in which an LLM autonomously generates symbolic planning domains that serve a dual role: (i) enabling large-scale generation of robotic planning problem-plan pairs, and (ii) providing structured reward functions. First, the self-written domains enable large-scale generation of symbolic task plans, which are automatically transformed into extended CoT trajectories for supervised fine-tuning. Second, the self-written domains are reused as structured reward functions, providing dense feedback for reinforcement learning without manual reward engineering. This unified training pipeline yields a planning-enhanced LLM with higher planning success rates, stronger cross-task generalization, reduced inference cost, and improved robustness to imperfect logical states.

Paper Structure

This paper contains 50 sections, 20 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the proposed Self-CriTeach framework. The base LLM first generates and iteratively refines PDDL planning domains, which are used to perform symbolic search and produce task plans with intermediate states. These plans are converted into chain-of-thought traces by the same base LLM by including plan explanation, state-transition checking, alternative exploration, and failure backtracking. The resulting CoT data is first used for supervised fine-tuning, after which the same self-written planning domains provide structured reward signals for reinforcement learning. Together, supervised and reinforcement learning enable the model to internalize symbolic planning behavior, yielding a reasoning-enhanced LLM with improved generalization and long-horizon reasoning.
  • Figure 2: Overall success rate versus average per-plan token cost across top-performing baseline approaches.
  • Figure 3: Real Robot Planning with SCT-4B: Reorganize Room (Lower, 13 steps) ; Prepare Experiment (Upper, 8 steps)
  • Figure 4: Evaluation data distribution for Blocks World Classic
  • Figure 5: Evaluation data distribution for Blocks World Classic
  • ...and 1 more figures