Towards Fusing Point Cloud and Visual Representations for Imitation Learning
Atalay Donat, Xiaogang Jia, Xi Huang, Aleksandar Taranovic, Denis Blessing, Ge Li, Hongyi Zhou, Hanyi Zhang, Rudolf Lioutikov, Gerhard Neumann
TL;DR
This work tackles multimodal imitation learning for robotic manipulation by fusing RGB images and point clouds within a diffusion-based policy. FPV-Net processes image and geometric representations with FiLM-ResNet and FPS/KNN-based point-cloud encoding, respectively, and fuses them through adaptive layer normalization conditioning, with RGB as the conditioning input in the strongest configuration. Extensive RoboCasa experiments show that neither modality alone suffices across tasks and that AdaLN-based fusion (especially PC+L main with RGB conditioning) yields state-of-the-art performance, enhanced by incorporating local RGB features and language context. The approach demonstrates the value of cross-modal conditioning, preserving geometric detail while exploiting rich semantic information, and opens avenues for more sophisticated cross-modal fusion strategies in real-world manipulation tasks.
Abstract
Learning for manipulation requires using policies that have access to rich sensory information such as point clouds or RGB images. Point clouds efficiently capture geometric structures, making them essential for manipulation tasks in imitation learning. In contrast, RGB images provide rich texture and semantic information that can be crucial for certain tasks. Existing approaches for fusing both modalities assign 2D image features to point clouds. However, such approaches often lose global contextual information from the original images. In this work, we propose FPV-Net, a novel imitation learning method that effectively combines the strengths of both point cloud and RGB modalities. Our method conditions the point-cloud encoder on global and local image tokens using adaptive layer norm conditioning, leveraging the beneficial properties of both modalities. Through extensive experiments on the challenging RoboCasa benchmark, we demonstrate the limitations of relying on either modality alone and show that our method achieves state-of-the-art performance across all tasks.
