Table of Contents
Fetching ...

Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning

Nikos Giannakakis, Argyris Manetas, Panagiotis P. Filntisis, Petros Maragos, George Retsinas

TL;DR

This paper tackles robot visuo-motor policy learning by learning action-aware, object-centric scene representations. It introduces an encoder that couples semantic segmentation and feature extraction via the SOLV slot-attention framework, producing a unified representation with both 'what' and 'where' components and a final shape of $ (D_{what}+D_{where})×4 = (128+100)×4 = 912 $. The model is pretrained on large out-of-domain video data and fine-tuned on human-action videos (Something-Something) before evaluation on imitation (Behavioral Cloning) and offline RL (Implicit Q-Learning) tasks in the TOTO pouring scenario, where it outperforms baselines. The results show that action-enhanced object-centric representations yield higher rewards and success rates, with 4 slots and the inclusion of spatial information providing the strongest gains, highlighting the value of transferring action-object knowledge from human videos to robotic tasks and reducing annotation needs for robot-specific data.

Abstract

Learning visual representations from observing actions to benefit robot visuo-motor policy generation is a promising direction that closely resembles human cognitive function and perception. Motivated by this, and further inspired by psychological theories suggesting that humans process scenes in an object-based fashion, we propose an object-centric encoder that performs semantic segmentation and visual representation generation in a coupled manner, unlike other works, which treat these as separate processes. To achieve this, we leverage the Slot Attention mechanism and use the SOLV model, pretrained in large out-of-domain datasets, to bootstrap fine-tuning on human action video data. Through simulated robotic tasks, we demonstrate that visual representations can enhance reinforcement and imitation learning training, highlighting the effectiveness of our integrated approach for semantic segmentation and encoding. Furthermore, we show that exploiting models pretrained on out-of-domain datasets can benefit this process, and that fine-tuning on datasets depicting human actions -- although still out-of-domain -- , can significantly improve performance due to close alignment with robotic tasks. These findings show the capability to reduce reliance on annotated or robot-specific action datasets and the potential to build on existing visual encoders to accelerate training and improve generalizability.

Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning

TL;DR

This paper tackles robot visuo-motor policy learning by learning action-aware, object-centric scene representations. It introduces an encoder that couples semantic segmentation and feature extraction via the SOLV slot-attention framework, producing a unified representation with both 'what' and 'where' components and a final shape of . The model is pretrained on large out-of-domain video data and fine-tuned on human-action videos (Something-Something) before evaluation on imitation (Behavioral Cloning) and offline RL (Implicit Q-Learning) tasks in the TOTO pouring scenario, where it outperforms baselines. The results show that action-enhanced object-centric representations yield higher rewards and success rates, with 4 slots and the inclusion of spatial information providing the strongest gains, highlighting the value of transferring action-object knowledge from human videos to robotic tasks and reducing annotation needs for robot-specific data.

Abstract

Learning visual representations from observing actions to benefit robot visuo-motor policy generation is a promising direction that closely resembles human cognitive function and perception. Motivated by this, and further inspired by psychological theories suggesting that humans process scenes in an object-based fashion, we propose an object-centric encoder that performs semantic segmentation and visual representation generation in a coupled manner, unlike other works, which treat these as separate processes. To achieve this, we leverage the Slot Attention mechanism and use the SOLV model, pretrained in large out-of-domain datasets, to bootstrap fine-tuning on human action video data. Through simulated robotic tasks, we demonstrate that visual representations can enhance reinforcement and imitation learning training, highlighting the effectiveness of our integrated approach for semantic segmentation and encoding. Furthermore, we show that exploiting models pretrained on out-of-domain datasets can benefit this process, and that fine-tuning on datasets depicting human actions -- although still out-of-domain -- , can significantly improve performance due to close alignment with robotic tasks. These findings show the capability to reduce reliance on annotated or robot-specific action datasets and the potential to build on existing visual encoders to accelerate training and improve generalizability.

Paper Structure

This paper contains 11 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Approach Overview: Our proposed encoder extracts "what" and "where" visual object-centric embeddings, which are combined to form scene representations. These embeddings distill action-based knowledge acquired from pretraining on human action videos to effectively guide robot policy learning.
  • Figure 2: Our proposed architecture: Image embeddings produced by DINOv2 are processed by the Slot Attention Module to generate slots. The Slot Merger then outputs four final object-specific features, referred to as "what" vectors, each corresponding to a distinct semantic region. These are combined with the associated "where" vectors, which encode spatial information derived from the attention masks, to form a unified object-centric scene representation.
  • Figure 3: Experimental Environment: We simulate the Franka Emika Panda robot arm pouring task within the TOTO framework. The goal is to transfer as many spheres as possible from the cup to the container.
  • Figure 4: Simulation frames from the TOTO pouring task, segmented by the Slot Attention masks of the SOLV model, with the Slot Merger module outputting 4, 6, and 8 slots respectively (from left to right).
  • Figure 5: Success rates and mean rewards of different number of slots configurations in the TOTO pouring simulation task. Each variant was trained five times and evaluated across 100 trajectories.
  • ...and 1 more figures