A Backbone for Long-Horizon Robot Task Understanding
Xiaoshuai Chen, Wei Chen, Dongmyoung Lee, Yukun Ge, Nicolas Rojas, Petar Kormushev
TL;DR
Long-horizon robot tasks suffer from poor generalization and data inefficiency in end-to-end learning. This work introduces the Therblig-Based Backbone Framework (TBBF), a structured backbone that decomposes tasks into therbligs and integrates an offline segmentation network (MGSF), an action-registration module (ActionREG), and a LLM-alignment visual-correction policy (LAP-VC) to enable one-shot transfer to new scenarios. Empirical results show high therblig segmentation recall (94.37%), and robust online task success (94.4% in simple scenarios, 80% in complex scenarios), with LAP-VC achieving strong alignment. The framework improves interpretability, data efficiency, and generalization, enabling more reliable long-horizon robot manipulation in cluttered and dynamic environments. The authors also outline future work on larger datasets, 3D configurations, and deploying a local LLM to reduce latency.
Abstract
End-to-end robot learning, particularly for long-horizon tasks, often results in unpredictable outcomes and poor generalization. To address these challenges, we propose a novel Therblig-Based Backbone Framework (TBBF) as a fundamental structure to enhance interpretability, data efficiency, and generalization in robotic systems. TBBF utilizes expert demonstrations to enable therblig-level task decomposition, facilitate efficient action-object mapping, and generate adaptive trajectories for new scenarios. The approach consists of two stages: offline training and online testing. During the offline training stage, we developed the Meta-RGate SynerFusion (MGSF) network for accurate therblig segmentation across various tasks. In the online testing stage, after a one-shot demonstration of a new task is collected, our MGSF network extracts high-level knowledge, which is then encoded into the image using Action Registration (ActionREG). Additionally, Large Language Model (LLM)-Alignment Policy for Visual Correction (LAP-VC) is employed to ensure precise action registration, facilitating trajectory transfer in novel robot scenarios. Experimental results validate these methods, achieving 94.37% recall in therblig segmentation and success rates of 94.4% and 80% in real-world online robot testing for simple and complex scenarios, respectively. Supplementary material is available at: https://sites.google.com/view/therbligsbasedbackbone/home
