LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Yue Yang; Shuo Cheng; Yu Fang; Homanga Bharadhwaj; Mingyu Ding; Gedas Bertasius; Daniel Szafir

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding, Gedas Bertasius, Daniel Szafir

TL;DR

The proposed LiLo-VLA (Linked Local VLA), a modular framework capable of zero-shot generalization to novel long-horizon tasks without ever being trained on them, facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end-to-end approaches.

Abstract

General-purpose robots must master long-horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision-Language-Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo-VLA (Linked Local VLA), a modular framework capable of zero-shot generalization to novel long-horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object-centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end-to-end approaches. We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long. In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%. Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%. Project page: https://yy-gx.github.io/LiLo-VLA/.

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

TL;DR

Abstract

Paper Structure (46 sections, 3 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 46 sections, 3 equations, 10 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Vision-Language-Action Models
Long-Horizon Manipulation and Skill Chaining
Methodology
Overview of LiLo-VLA
The Reaching Module: Global Transport
Relative Goal Generation with Perturbation
Collision-Free Motion Planning
The Interaction Module: Object-Centric VLA
Object-Centric Observation Space
Visual Clutter Augmentation
Compositional Execution and Failure Recovery
Sequential Execution Pipeline
Closed-Loop Recovery Mechanism
...and 31 more sections

Figures (10)

Figure 1: LiLo-VLA enables composable and robust manipulation. LiLo-VLA solves long-horizon tasks by sequentially executing object-centric skill policies connected by robust motion planning. This enables zero-shot compositional generalization and robustness against cascading failures.
Figure 2: Architecture of LiLo-VLA. Our framework decouples manipulation into two distinct phases. (Top Left) The Reaching Module handles global transport via collision-free motion planning. It employs initial state perturbation during training to ensure the policy to be robust to pose errors during deployment. (Top Right) The Interaction Module executes atomic skills via an object-centric VLA, strictly utilizing wrist-view observations and visual masking to eliminate environmental distractors. (Bottom) The system sequentially chains these modules, enabling closed-loop failure recovery where each skill's execution errors trigger a fallback to the Reaching Module for state resetting.
Figure 3: Overview of Evaluation Benchmarks. We introduce two suites to evaluate long-horizon manipulation: Suite 1 (LIBERO-Long++) focuses on visual robustness by introducing more complex backgrounds with multiple distractors (highlighted in red), while Suite 2 (Ultra-Long) tests temporal scalability with task sequences extending up to 16 steps. Both suites incorporate multiple variant configurations with permuted skill orders to rigorously assess zero-shot compositional generalization.
Figure 4: Impact of State Perturbation. Average success rates across 27 unique skills demonstrate that our interaction policy remains robust to initial pose noise due to state perturbation, whereas the unperturbed policy degrades significantly.
Figure 5: Comparison across different camera configurations. Our wrist-only design achieves the highest success rate with the minimal performance drop under OSS.
...and 5 more figures

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

TL;DR

Abstract

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (10)