Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning
Nikos Giannakakis, Argyris Manetas, Panagiotis P. Filntisis, Petros Maragos, George Retsinas
TL;DR
This paper tackles robot visuo-motor policy learning by learning action-aware, object-centric scene representations. It introduces an encoder that couples semantic segmentation and feature extraction via the SOLV slot-attention framework, producing a unified representation with both 'what' and 'where' components and a final shape of $ (D_{what}+D_{where})×4 = (128+100)×4 = 912 $. The model is pretrained on large out-of-domain video data and fine-tuned on human-action videos (Something-Something) before evaluation on imitation (Behavioral Cloning) and offline RL (Implicit Q-Learning) tasks in the TOTO pouring scenario, where it outperforms baselines. The results show that action-enhanced object-centric representations yield higher rewards and success rates, with 4 slots and the inclusion of spatial information providing the strongest gains, highlighting the value of transferring action-object knowledge from human videos to robotic tasks and reducing annotation needs for robot-specific data.
Abstract
Learning visual representations from observing actions to benefit robot visuo-motor policy generation is a promising direction that closely resembles human cognitive function and perception. Motivated by this, and further inspired by psychological theories suggesting that humans process scenes in an object-based fashion, we propose an object-centric encoder that performs semantic segmentation and visual representation generation in a coupled manner, unlike other works, which treat these as separate processes. To achieve this, we leverage the Slot Attention mechanism and use the SOLV model, pretrained in large out-of-domain datasets, to bootstrap fine-tuning on human action video data. Through simulated robotic tasks, we demonstrate that visual representations can enhance reinforcement and imitation learning training, highlighting the effectiveness of our integrated approach for semantic segmentation and encoding. Furthermore, we show that exploiting models pretrained on out-of-domain datasets can benefit this process, and that fine-tuning on datasets depicting human actions -- although still out-of-domain -- , can significantly improve performance due to close alignment with robotic tasks. These findings show the capability to reduce reliance on annotated or robot-specific action datasets and the potential to build on existing visual encoders to accelerate training and improve generalizability.
