Table of Contents
Fetching ...

Look Before You Leap: Using Serialized State Machine for Language Conditioned Robotic Manipulation

Tong Mu, Yihao Liu, Mehran Armand

TL;DR

This work tackles the brittleness of language-conditioned imitation learning for long-horizon robotic manipulation by introducing a State Machine Serialization Language (SMSL) that guides demonstration generation through environment-aware state transitions. By leveraging LLMs to synthesize state, operation, and transition definitions, and by enforcing deterministic, constraint-consistent environment initializations, the approach achieves high demonstration coverage and robust long-horizon policy learning. Across three complex puzzles, the method substantially improves success rates over random-placement baselines, attaining up to 98% success with 1000 demonstrations per operation, highlighting the value of explicit state-aware data generation for scalable, language-conditioned robotics. The River Crossing formalism and the accompanying SMSL-based pipeline illustrate how finite-state reasoning can be integrated with language-conditioned policies to mitigate cascading errors in dynamic environments, with practical implications for robust, real-world manipulation tasks.

Abstract

Imitation learning frameworks for robotic manipulation have drawn attention in the recent development of language model grounded robotics. However, the success of the frameworks largely depends on the coverage of the demonstration cases: When the demonstration set does not include examples of how to act in all possible situations, the action may fail and can result in cascading errors. To solve this problem, we propose a framework that uses serialized Finite State Machine (FSM) to generate demonstrations and improve the success rate in manipulation tasks requiring a long sequence of precise interactions. To validate its effectiveness, we use environmentally evolving and long-horizon puzzles that require long sequential actions. Experimental results show that our approach achieves a success rate of up to 98 in these tasks, compared to the controlled condition using existing approaches, which only had a success rate of up to 60, and, in some tasks, almost failed completely.

Look Before You Leap: Using Serialized State Machine for Language Conditioned Robotic Manipulation

TL;DR

This work tackles the brittleness of language-conditioned imitation learning for long-horizon robotic manipulation by introducing a State Machine Serialization Language (SMSL) that guides demonstration generation through environment-aware state transitions. By leveraging LLMs to synthesize state, operation, and transition definitions, and by enforcing deterministic, constraint-consistent environment initializations, the approach achieves high demonstration coverage and robust long-horizon policy learning. Across three complex puzzles, the method substantially improves success rates over random-placement baselines, attaining up to 98% success with 1000 demonstrations per operation, highlighting the value of explicit state-aware data generation for scalable, language-conditioned robotics. The River Crossing formalism and the accompanying SMSL-based pipeline illustrate how finite-state reasoning can be integrated with language-conditioned policies to mitigate cascading errors in dynamic environments, with practical implications for robust, real-world manipulation tasks.

Abstract

Imitation learning frameworks for robotic manipulation have drawn attention in the recent development of language model grounded robotics. However, the success of the frameworks largely depends on the coverage of the demonstration cases: When the demonstration set does not include examples of how to act in all possible situations, the action may fail and can result in cascading errors. To solve this problem, we propose a framework that uses serialized Finite State Machine (FSM) to generate demonstrations and improve the success rate in manipulation tasks requiring a long sequence of precise interactions. To validate its effectiveness, we use environmentally evolving and long-horizon puzzles that require long sequential actions. Experimental results show that our approach achieves a success rate of up to 98 in these tasks, compared to the controlled condition using existing approaches, which only had a success rate of up to 60, and, in some tasks, almost failed completely.

Paper Structure

This paper contains 17 sections, 6 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison between a randomly initialized scene using the existing approach for data collection (a) and the actual scene the system encounters when completing the Towers of Hanoi task (b).
  • Figure 2: Overall architecture of the proposed method. The upstream processes the high-level task goal to provide detailed task descriptions with constraints to the midstream. We propose two methods in the midstream: The direct generation is to split the whole SMSL generation into multiple state branches to be generated separately and then integrate them. Indirect generation will use LLM to generate a symbolic planning script to plan and format an SMSL text. The LLM agents in downstream with engineered prompts will take this SMSL text along with the task description and constraints from the upstream to generate the task demonstration code. We will then use the task code to collect the dataset for imitation learning in a "state-awared" workflow.
  • Figure 3: Comparison of demonstration generation workflows. The existing bottom-up approach is to let LLM to explore potential new tasks from the previous task list. The existing top-down approach will use the user-defined task name and generate code for this specific task. Both of these methods will reset the environment and spawn objects randomly on the tabletop when generating a demo, which are not suitable for long-horizon tasks or tasks with dynamically evolving spatial constraints.
  • Figure 4: Task initial scene