Table of Contents
Fetching ...

LangPert: Detecting and Handling Task-level Perturbations for Robust Object Rearrangement

Xu Yin, Min-Sung Yoon, Yuchi Huo, Kang Zhang, Sung-Eui Yoon

TL;DR

LangPert tackles the challenge of Task-Level Perturbations in tabletop object rearrangement by integrating a Vision Language Model for global task monitoring with a Hierarchical Chain-of-Thought powered LLM planner and a language-conditioned low-level policy. The framework uses dual-view perception to detect both execution outcomes and perturbations, and applies HCoT reasoning to generate adaptive corrective plans, enabling robust re-planning in dynamic environments. Empirical results show LangPert achieves higher task completion rates and greater efficiency than baselines across ADD, RMV, and DIS perturbations, with strong generalization to unseen scenarios. The approach offers a practical route toward robust autonomous rearrangement in unstructured settings, with future work oriented toward real-world deployment and richer sensing modalities.

Abstract

Task execution for object rearrangement could be challenged by Task-Level Perturbations (TLP), i.e., unexpected object additions, removals, and displacements that can disrupt underlying visual policies and fundamentally compromise task feasibility and progress. To address these challenges, we present LangPert, a language-based framework designed to detect and mitigate TLP situations in tabletop rearrangement tasks. LangPert integrates a Visual Language Model (VLM) to comprehensively monitor policy's skill execution and environmental TLP, while leveraging the Hierarchical Chain-of-Thought (HCoT) reasoning mechanism to enhance the Large Language Model (LLM)'s contextual understanding and generate adaptive, corrective skill-execution plans. Our experimental results demonstrate that LangPert handles diverse TLP situations more effectively than baseline methods, achieving higher task completion rates, improved execution efficiency, and potential generalization to unseen scenarios.

LangPert: Detecting and Handling Task-level Perturbations for Robust Object Rearrangement

TL;DR

LangPert tackles the challenge of Task-Level Perturbations in tabletop object rearrangement by integrating a Vision Language Model for global task monitoring with a Hierarchical Chain-of-Thought powered LLM planner and a language-conditioned low-level policy. The framework uses dual-view perception to detect both execution outcomes and perturbations, and applies HCoT reasoning to generate adaptive corrective plans, enabling robust re-planning in dynamic environments. Empirical results show LangPert achieves higher task completion rates and greater efficiency than baselines across ADD, RMV, and DIS perturbations, with strong generalization to unseen scenarios. The approach offers a practical route toward robust autonomous rearrangement in unstructured settings, with future work oriented toward real-world deployment and richer sensing modalities.

Abstract

Task execution for object rearrangement could be challenged by Task-Level Perturbations (TLP), i.e., unexpected object additions, removals, and displacements that can disrupt underlying visual policies and fundamentally compromise task feasibility and progress. To address these challenges, we present LangPert, a language-based framework designed to detect and mitigate TLP situations in tabletop rearrangement tasks. LangPert integrates a Visual Language Model (VLM) to comprehensively monitor policy's skill execution and environmental TLP, while leveraging the Hierarchical Chain-of-Thought (HCoT) reasoning mechanism to enhance the Large Language Model (LLM)'s contextual understanding and generate adaptive, corrective skill-execution plans. Our experimental results demonstrate that LangPert handles diverse TLP situations more effectively than baseline methods, achieving higher task completion rates, improved execution efficiency, and potential generalization to unseen scenarios.

Paper Structure

This paper contains 12 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Environmental perturbation example in a box-packing task. As the robot places sneaker in brown box at step 0 ($k$=0), a new box unexpectedly appears (marked with the red dashed box). Standard VLM approaches success_detectordoremi detect only the robot’s execution outcome and lack awareness of broader environmental changes, leading the robot to place subsequent objects in the incorrect box. Handling such perturbations requires continuous global monitoring to detect perturbations and generate corrective strategies accordingly.
  • Figure 2: Illustration of how ADD perturbations affect affordance predictions. In the Normal Scenario (left), the original affordance predictions for placing pink block in brown box are shown, with pixel values indicating the probability of action success. In the Add A New Box and Add A New Pink Block scenarios (middle and right), the added objects (highlighted in red dashed boxes) alter these affordance distributions, potentially introducing ambiguity in execution.
  • Figure 3: Framework overview. LangPert comprises: (1) a global VLM monitor that provides real-time workspace observations to detect execution failures and perturbations; (2) an LLM-based planner, which utilizes Hierarchical CoT reasoning to generate corrective plans based on VLM feedback; and (3) a language-conditioned actor module cliport, which executes low-level visual policies based on the skill instruction.
  • Figure 4: Camera configuration and VQA template for VLM. At each step $k$, given the skill instruction $\ell_{k}$, we capture RGB observations from both front and top-down views, forming two sequences of $N$ frames. The VLM is then queried to reason about the skill execution outcome and the perturbation.
  • Figure 5: Prompt structure with HCoT reasoning. Illustrated with the matching task ground_decoding, our prompt consists of the following tags: "User" to provide the initial state description, "Robot" for fixed monologue instructions, and "VLM" for updates from the VLM. Upon detecting a perturbation $e_k$, the LLM planner is prompted to analyze $e_k$ through multiple CoT steps, following a layered structure (feasibility → progress → operation) to generate the corrective plan.
  • ...and 2 more figures