Table of Contents
Fetching ...

UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning

Hongyu Chen, Guangrun Wang

TL;DR

This work introduces UML-CoT, a structured chain-of-thought framework that uses UML class diagrams for symbolic reasoning and UML activity diagrams for executable plans in robotic room cleaning. By replacing unstructured text with expressive UML representations, the approach addresses interpretability, verification, and planning reliability, unifying reasoning and action under a formalism that supports inheritance, aggregation, and procedural control. A three-stage training pipeline (SFT, RLFT with GRPO, and answer-only GRPO) together with the MRoom-30k dataset demonstrates improved plan coherence, structural fidelity, and execution success over text-based and graph-based baselines. The results highlight the practical impact of structured symbolic representations for embodied AI, enabling more transparent and robust coordination between perception, reasoning, and manipulation in cluttered indoor environments.

Abstract

Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), but its reliance on unstructured text limits interpretability and executability in embodied tasks. Prior work has explored structured CoTs using scene or logic graphs, yet these remain fundamentally limited: they model only low-order relations, lack constructs like inheritance or behavioral abstraction, and provide no standardized semantics for sequential or conditional planning. We propose UML-CoT, a structured reasoning and planning framework that leverages Unified Modeling Language (UML) to generate symbolic CoTs and executable action plans. UML class diagrams capture compositional object semantics, while activity diagrams model procedural control flow. Our three-stage training pipeline combines supervised fine-tuning with Group Relative Policy Optimization (GRPO), including reward learning from answer-only data. We evaluate UML-CoT on MRoom-30k, a new benchmark of cluttered room-cleaning scenarios. UML-CoT outperforms unstructured CoTs in interpretability, planning coherence, and execution success, highlighting UML as a more expressive and actionable structured reasoning formalism.

UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning

TL;DR

This work introduces UML-CoT, a structured chain-of-thought framework that uses UML class diagrams for symbolic reasoning and UML activity diagrams for executable plans in robotic room cleaning. By replacing unstructured text with expressive UML representations, the approach addresses interpretability, verification, and planning reliability, unifying reasoning and action under a formalism that supports inheritance, aggregation, and procedural control. A three-stage training pipeline (SFT, RLFT with GRPO, and answer-only GRPO) together with the MRoom-30k dataset demonstrates improved plan coherence, structural fidelity, and execution success over text-based and graph-based baselines. The results highlight the practical impact of structured symbolic representations for embodied AI, enabling more transparent and robust coordination between perception, reasoning, and manipulation in cluttered indoor environments.

Abstract

Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), but its reliance on unstructured text limits interpretability and executability in embodied tasks. Prior work has explored structured CoTs using scene or logic graphs, yet these remain fundamentally limited: they model only low-order relations, lack constructs like inheritance or behavioral abstraction, and provide no standardized semantics for sequential or conditional planning. We propose UML-CoT, a structured reasoning and planning framework that leverages Unified Modeling Language (UML) to generate symbolic CoTs and executable action plans. UML class diagrams capture compositional object semantics, while activity diagrams model procedural control flow. Our three-stage training pipeline combines supervised fine-tuning with Group Relative Policy Optimization (GRPO), including reward learning from answer-only data. We evaluate UML-CoT on MRoom-30k, a new benchmark of cluttered room-cleaning scenarios. UML-CoT outperforms unstructured CoTs in interpretability, planning coherence, and execution success, highlighting UML as a more expressive and actionable structured reasoning formalism.

Paper Structure

This paper contains 50 sections, 8 equations, 15 figures, 4 tables, 1 algorithm.

Figures (15)

  • Figure 1: Structured vs. Unstructured Chain-of-Thought Reasoning for Robotic Room Cleaning. (a) Input: a cluttered room image and a text instruction. (b) Output from a plain-text CoT model, where reasoning and planning are expressed only in free-form language, lacking formal semantics and executable structure. (c) Output from our proposed UML-based framework, where reasoning is encoded as a UML class diagram and the corresponding plan is formalized as a UML activity diagram. This structured approach improves interpretability, ensures alignment between reasoning and action, and supports modular, executable planning. Please zoom in to view details clearly.
  • Figure 2: UML class diagram and its corresponding PlantUML source shown in two equivalent forms: (a) UML class diagram; (b) PlantUML source code. Please zoom in to view details clearly.
  • Figure 3: UML activity diagram and its corresponding PlantUML source presented in two equivalent forms: (a) UML activity diagram; (b) corresponding PlantUML code. Please zoom in to view details clearly.
  • Figure 4: Model architecture. The image is preprocessed via dynamic resolution slicing and fed into a ViT-based encoder (InternViT-300M), followed by pixel unshuffling and MLP projection. The language decoder (InternLM2.5) receives both visual features and tokenized textual prompts, and generates two symbolic outputs: a UML class diagram for structured reasoning, and a UML activity diagram for executable cleaning plans. Please zoom in to view details clearly.
  • Figure 5: Overview of Group Relative Policy Optimization (GRPO), applicable to both Stage 2 and Stage 3. The model receives image, CoT, and plan as input, generates multiple candidates, evaluates them using advantage scores, and updates its parameters based on the best-scoring candidate.
  • ...and 10 more figures