Table of Contents
Fetching ...

RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation

Zixuan Chen, Jing Huo, Yangtao Chen, Yang Gao

TL;DR

RoboHorizon addresses the challenge of long-horizon robotic manipulation by combining an LLM-assisted reward generation flow with a key-horizon, multi-view perception module (KMV-MAE) to form a RoboHorizon world model. The Recognize-Sense-Plan-Act pipeline enables dense, stage-aware rewards, robust perception across multiple viewpoints, and planning through a recurrent dynamics-based world model that supports DreamerV2-style control. Empirical results on RLBench and FurnitureBench show RoboHorizon significantly outperforms state-of-the-art visual model-based RL baselines, highlighting the value of staged rewards and key-horizon perception for long-horizon tasks. This framework advances practical long-horizon robotic manipulation by improving task recognition, perception, and planning in challenging, sparse-reward environments.

Abstract

Efficient control in long-horizon robotic manipulation is challenging due to complex representation and policy learning requirements. Model-based visual reinforcement learning (RL) has shown great potential in addressing these challenges but still faces notable limitations, particularly in handling sparse rewards and complex visual features in long-horizon environments. To address these limitations, we propose the Recognize-Sense-Plan-Act (RSPA) pipeline for long-horizon tasks and further introduce RoboHorizon, an LLM-assisted multi-view world model tailored for long-horizon robotic manipulation. In RoboHorizon, pre-trained LLMs generate dense reward structures for multi-stage sub-tasks based on task language instructions, enabling robots to better recognize long-horizon tasks. Keyframe discovery is then integrated into the multi-view masked autoencoder (MAE) architecture to enhance the robot's ability to sense critical task sequences, strengthening its multi-stage perception of long-horizon processes. Leveraging these dense rewards and multi-view representations, a robotic world model is constructed to efficiently plan long-horizon tasks, enabling the robot to reliably act through RL algorithms. Experiments on two representative benchmarks, RLBench and FurnitureBench, show that RoboHorizon outperforms state-of-the-art visual model-based RL methods, achieving a 23.35% improvement in task success rates on RLBench's 4 short-horizon tasks and a 29.23% improvement on 6 long-horizon tasks from RLBench and 3 furniture assembly tasks from FurnitureBench.

RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation

TL;DR

RoboHorizon addresses the challenge of long-horizon robotic manipulation by combining an LLM-assisted reward generation flow with a key-horizon, multi-view perception module (KMV-MAE) to form a RoboHorizon world model. The Recognize-Sense-Plan-Act pipeline enables dense, stage-aware rewards, robust perception across multiple viewpoints, and planning through a recurrent dynamics-based world model that supports DreamerV2-style control. Empirical results on RLBench and FurnitureBench show RoboHorizon significantly outperforms state-of-the-art visual model-based RL baselines, highlighting the value of staged rewards and key-horizon perception for long-horizon tasks. This framework advances practical long-horizon robotic manipulation by improving task recognition, perception, and planning in challenging, sparse-reward environments.

Abstract

Efficient control in long-horizon robotic manipulation is challenging due to complex representation and policy learning requirements. Model-based visual reinforcement learning (RL) has shown great potential in addressing these challenges but still faces notable limitations, particularly in handling sparse rewards and complex visual features in long-horizon environments. To address these limitations, we propose the Recognize-Sense-Plan-Act (RSPA) pipeline for long-horizon tasks and further introduce RoboHorizon, an LLM-assisted multi-view world model tailored for long-horizon robotic manipulation. In RoboHorizon, pre-trained LLMs generate dense reward structures for multi-stage sub-tasks based on task language instructions, enabling robots to better recognize long-horizon tasks. Keyframe discovery is then integrated into the multi-view masked autoencoder (MAE) architecture to enhance the robot's ability to sense critical task sequences, strengthening its multi-stage perception of long-horizon processes. Leveraging these dense rewards and multi-view representations, a robotic world model is constructed to efficiently plan long-horizon tasks, enabling the robot to reliably act through RL algorithms. Experiments on two representative benchmarks, RLBench and FurnitureBench, show that RoboHorizon outperforms state-of-the-art visual model-based RL methods, achieving a 23.35% improvement in task success rates on RLBench's 4 short-horizon tasks and a 29.23% improvement on 6 long-horizon tasks from RLBench and 3 furniture assembly tasks from FurnitureBench.
Paper Structure (43 sections, 8 equations, 10 figures, 1 table)

This paper contains 43 sections, 8 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: The proposed RSPA pipeline for long-horizon robotic manipulation.
  • Figure 2: RoboHorizon overview, using the long-horizon robotic manipulation task "take shoes out of box" in RLBench as the illustration example, following the proposed RSPA pipeline.
  • Figure 3: Visualizing the RGB observations of keyframes from four camera viewpoints for the take shoes out of the box task using the keyframe discovery method, and displaying the key-horizon between the last two keyframes from the front viewpoint.
  • Figure 4: Visualization of multi-view demonstrations from front, left, right, and wrist cameras for 10 RLBench tasks, and from front and wrist cameras for 3 FurnitureBench tasks.
  • Figure 5: SPA-driven baselines with LLM-generated dense rewards vs. RoboHorizon.
  • ...and 5 more figures