Table of Contents
Fetching ...

Towards Fusing Point Cloud and Visual Representations for Imitation Learning

Atalay Donat, Xiaogang Jia, Xi Huang, Aleksandar Taranovic, Denis Blessing, Ge Li, Hongyi Zhou, Hanyi Zhang, Rudolf Lioutikov, Gerhard Neumann

TL;DR

This work tackles multimodal imitation learning for robotic manipulation by fusing RGB images and point clouds within a diffusion-based policy. FPV-Net processes image and geometric representations with FiLM-ResNet and FPS/KNN-based point-cloud encoding, respectively, and fuses them through adaptive layer normalization conditioning, with RGB as the conditioning input in the strongest configuration. Extensive RoboCasa experiments show that neither modality alone suffices across tasks and that AdaLN-based fusion (especially PC+L main with RGB conditioning) yields state-of-the-art performance, enhanced by incorporating local RGB features and language context. The approach demonstrates the value of cross-modal conditioning, preserving geometric detail while exploiting rich semantic information, and opens avenues for more sophisticated cross-modal fusion strategies in real-world manipulation tasks.

Abstract

Learning for manipulation requires using policies that have access to rich sensory information such as point clouds or RGB images. Point clouds efficiently capture geometric structures, making them essential for manipulation tasks in imitation learning. In contrast, RGB images provide rich texture and semantic information that can be crucial for certain tasks. Existing approaches for fusing both modalities assign 2D image features to point clouds. However, such approaches often lose global contextual information from the original images. In this work, we propose FPV-Net, a novel imitation learning method that effectively combines the strengths of both point cloud and RGB modalities. Our method conditions the point-cloud encoder on global and local image tokens using adaptive layer norm conditioning, leveraging the beneficial properties of both modalities. Through extensive experiments on the challenging RoboCasa benchmark, we demonstrate the limitations of relying on either modality alone and show that our method achieves state-of-the-art performance across all tasks.

Towards Fusing Point Cloud and Visual Representations for Imitation Learning

TL;DR

This work tackles multimodal imitation learning for robotic manipulation by fusing RGB images and point clouds within a diffusion-based policy. FPV-Net processes image and geometric representations with FiLM-ResNet and FPS/KNN-based point-cloud encoding, respectively, and fuses them through adaptive layer normalization conditioning, with RGB as the conditioning input in the strongest configuration. Extensive RoboCasa experiments show that neither modality alone suffices across tasks and that AdaLN-based fusion (especially PC+L main with RGB conditioning) yields state-of-the-art performance, enhanced by incorporating local RGB features and language context. The approach demonstrates the value of cross-modal conditioning, preserving geometric detail while exploiting rich semantic information, and opens avenues for more sophisticated cross-modal fusion strategies in real-world manipulation tasks.

Abstract

Learning for manipulation requires using policies that have access to rich sensory information such as point clouds or RGB images. Point clouds efficiently capture geometric structures, making them essential for manipulation tasks in imitation learning. In contrast, RGB images provide rich texture and semantic information that can be crucial for certain tasks. Existing approaches for fusing both modalities assign 2D image features to point clouds. However, such approaches often lose global contextual information from the original images. In this work, we propose FPV-Net, a novel imitation learning method that effectively combines the strengths of both point cloud and RGB modalities. Our method conditions the point-cloud encoder on global and local image tokens using adaptive layer norm conditioning, leveraging the beneficial properties of both modalities. Through extensive experiments on the challenging RoboCasa benchmark, we demonstrate the limitations of relying on either modality alone and show that our method achieves state-of-the-art performance across all tasks.

Paper Structure

This paper contains 28 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Processing each input modality to generate corresponding embeddings. Top: A FiLM-ResNet architecture is used to extract a feature map from the context image. The feature map is processed through average pooling and flattening to obtain global and local feature tokens, which are then concatenated and fed into the transformer along with a learnable CLS token, whose output is used as a condition vector for the diffusion policy (Figure \ref{['fig:dit_block']}). Middle: The point cloud input is processed by applying FPS to sample points, followed by KNN to group point patches using these FPS points as centers. The resulting patches are passed through a point patches encoder, which can be a lightweight MLP or the pretrained SUGAR model. Bottom: The CLIP model is employed to generate the language embedding for the behavior prompt.
  • Figure 2: Conditioned on image CLS tokens, the transformer-based diffusion policy (DiT block) denoises action chunk tokens by utilizing 3D point cloud tokens and language tokens as inputs. The conditioning process is detailed within the structure of the DiT block.
  • Figure 3: Example scenarios from the RoboCasa benchmark robocasa2024 used in our experiments.
  • Figure 4: Success rates using different fusion types for point cloud and RGB images.
  • Figure 5: Success rates using max pool or transformer to obtain global feature vector of RGB images to use in AdaLN conditioning.
  • ...and 2 more figures