Table of Contents
Fetching ...

VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

Kuo-Han Hung, Pang-Chi Lo, Jia-Fong Yeh, Han-Yuan Hsu, Yi-Ting Chen, Winston H. Hsu

TL;DR

VICtoR tackles reward learning for long-horizon robotic manipulation by modeling visual-instruction correlations with a hierarchical framework. It decomposes tasks into stages, motions, and motion progress, leveraging GPT-4 for task knowledge generation and CLIP-based detectors for object-state reasoning, while using a Motion Progress Evaluator with time, motion, and language-contrastive losses to provide dense, structured rewards. The reward signal is built from a potential function that increases as task progress accumulates, enabling policy optimization via shaping rewards. Across simulated and real-world tests, VICtoR outperforms prior VIC methods, particularly on harder tasks, and ablations confirm the value of hierarchical cues and contrastive objectives for robust long-horizon learning. The approach demonstrates practical applicability by learning from action-free videos and language instructions, with real-world data illustrating improved progress tracking and reward quality.

Abstract

We study reward models for long-horizon manipulation tasks by learning from action-free videos and language instructions, which we term the visual-instruction correlation (VIC) problem. Recent advancements in cross-modality modeling have highlighted the potential of reward modeling through visual and language correlations. However, existing VIC methods face challenges in learning rewards for long-horizon tasks due to their lack of sub-stage awareness, difficulty in modeling task complexities, and inadequate object state estimation. To address these challenges, we introduce VICtoR, a novel hierarchical VIC reward model capable of providing effective reward signals for long-horizon manipulation tasks. VICtoR precisely assesses task progress at various levels through a novel stage detector and motion progress evaluator, offering insightful guidance for agents learning the task effectively. To validate the effectiveness of VICtoR, we conducted extensive experiments in both simulated and real-world environments. The results suggest that VICtoR outperformed the best existing VIC methods, achieving a 43% improvement in success rates for long-horizon tasks.

VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

TL;DR

VICtoR tackles reward learning for long-horizon robotic manipulation by modeling visual-instruction correlations with a hierarchical framework. It decomposes tasks into stages, motions, and motion progress, leveraging GPT-4 for task knowledge generation and CLIP-based detectors for object-state reasoning, while using a Motion Progress Evaluator with time, motion, and language-contrastive losses to provide dense, structured rewards. The reward signal is built from a potential function that increases as task progress accumulates, enabling policy optimization via shaping rewards. Across simulated and real-world tests, VICtoR outperforms prior VIC methods, particularly on harder tasks, and ablations confirm the value of hierarchical cues and contrastive objectives for robust long-horizon learning. The approach demonstrates practical applicability by learning from action-free videos and language instructions, with real-world data illustrating improved progress tracking and reward quality.

Abstract

We study reward models for long-horizon manipulation tasks by learning from action-free videos and language instructions, which we term the visual-instruction correlation (VIC) problem. Recent advancements in cross-modality modeling have highlighted the potential of reward modeling through visual and language correlations. However, existing VIC methods face challenges in learning rewards for long-horizon tasks due to their lack of sub-stage awareness, difficulty in modeling task complexities, and inadequate object state estimation. To address these challenges, we introduce VICtoR, a novel hierarchical VIC reward model capable of providing effective reward signals for long-horizon manipulation tasks. VICtoR precisely assesses task progress at various levels through a novel stage detector and motion progress evaluator, offering insightful guidance for agents learning the task effectively. To validate the effectiveness of VICtoR, we conducted extensive experiments in both simulated and real-world environments. The results suggest that VICtoR outperformed the best existing VIC methods, achieving a 43% improvement in success rates for long-horizon tasks.
Paper Structure (62 sections, 9 equations, 12 figures, 12 tables)

This paper contains 62 sections, 9 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Problems in existing VIC methods and VICtoR's solution. Training long-horizon task with existing VIC methods commonly suffer from the listed problems. To address these problems, we propose VICtoR, a hierarchical reward model that can decompose long-horizon tasks and assign rewards by identifying the stage, motion, and progress of the agent from visual observations.
  • Figure 2: Training and inference pipeline of VICtoR. VICtoR is trained using motion-level videos with language annotations and object state labels. It first decomposes the task into task knowledge for decomposed stages, conditional object states, and motions. Then, it uses Stage Detector to identify the stage, and a Motion Progress Evaluator (VLM) to detect the motion and in-motion progress.
  • Figure 3: Environment information. Tasks are generated from permutations of actions on interactable objects shown in the figure.
  • Figure 4: Potential comparison across different tasks: We compare the potential generated by different reward models. In these comparisons, we can see that VICtoR provides the most progressive and near-strictly increasing potential function, especially as the horizon increases. This demonstrates its ability to provide fine-grained rewards for long-horizon tasks.
  • Figure 5: VICtoR on real world data. This figure displays LIV ma2023liv and VICtoR’s potential visualizations for long-horizon tasks from XSkill with correct and incorrect test videos.
  • ...and 7 more figures