Table of Contents
Fetching ...

CMIP-CIL: A Cross-Modal Benchmark for Image-Point Class Incremental Learning

Chao Qi, Jianqin Yin, Ren Zhang

TL;DR

This work addresses cross-modal class incremental learning (IP-CIL) by enabling a vision model to learn from 2D images and apply the knowledge to 3D point clouds while avoiding catastrophic forgetting. It introduces the CMIP-CIL benchmark and a contrastive masked image-point pre-training (RRM) strategy to build robust image-point correspondences, followed by a frozen-backbone incremental phase with adapters and prototype regularization to preserve prior knowledge. The framework achieves state-of-the-art performance on multimodal ModelNet40 and ShapeNet55 benchmarks, supported by extensive ablations that validate the importance of image-point alignment, prototype regularization, and the RRM data augmentation. The work holds practical significance for robotics, enabling continual, cross-modal perception improvements as objects evolve in dynamic environments.

Abstract

Image-point class incremental learning helps the 3D-points-vision robots continually learn category knowledge from 2D images, improving their perceptual capability in dynamic environments. However, some incremental learning methods address unimodal forgetting but fail in cross-modal cases, while others handle modal differences within training/testing datasets but assume no modal gaps between them. We first explore this cross-modal task, proposing a benchmark CMIP-CIL and relieving the cross-modal catastrophic forgetting problem. It employs masked point clouds and rendered multi-view images within a contrastive learning framework in pre-training, empowering the vision model with the generalizations of image-point correspondence. In the incremental stage, by freezing the backbone and promoting object representations close to their respective prototypes, the model effectively retains and generalizes knowledge across previously seen categories while continuing to learn new ones. We conduct comprehensive experiments on the benchmark datasets. Experiments prove that our method achieves state-of-the-art results, outperforming the baseline methods by a large margin.

CMIP-CIL: A Cross-Modal Benchmark for Image-Point Class Incremental Learning

TL;DR

This work addresses cross-modal class incremental learning (IP-CIL) by enabling a vision model to learn from 2D images and apply the knowledge to 3D point clouds while avoiding catastrophic forgetting. It introduces the CMIP-CIL benchmark and a contrastive masked image-point pre-training (RRM) strategy to build robust image-point correspondences, followed by a frozen-backbone incremental phase with adapters and prototype regularization to preserve prior knowledge. The framework achieves state-of-the-art performance on multimodal ModelNet40 and ShapeNet55 benchmarks, supported by extensive ablations that validate the importance of image-point alignment, prototype regularization, and the RRM data augmentation. The work holds practical significance for robotics, enabling continual, cross-modal perception improvements as objects evolve in dynamic environments.

Abstract

Image-point class incremental learning helps the 3D-points-vision robots continually learn category knowledge from 2D images, improving their perceptual capability in dynamic environments. However, some incremental learning methods address unimodal forgetting but fail in cross-modal cases, while others handle modal differences within training/testing datasets but assume no modal gaps between them. We first explore this cross-modal task, proposing a benchmark CMIP-CIL and relieving the cross-modal catastrophic forgetting problem. It employs masked point clouds and rendered multi-view images within a contrastive learning framework in pre-training, empowering the vision model with the generalizations of image-point correspondence. In the incremental stage, by freezing the backbone and promoting object representations close to their respective prototypes, the model effectively retains and generalizes knowledge across previously seen categories while continuing to learn new ones. We conduct comprehensive experiments on the benchmark datasets. Experiments prove that our method achieves state-of-the-art results, outperforming the baseline methods by a large margin.

Paper Structure

This paper contains 26 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Task of IP-CIL. Task 1: learns to classify objects in images, testing with point-cloud-based classifications. Task 2: learn new objects in images, testing point cloud ones in the current and former classes—the same in the following tasks.
  • Figure 2: Framework of the CMIP-CIL benchmark. Through image rendering with random masking points, image-point pairs $\{ z_i^j\} _{j = 1}^m \sim {\tilde{x}_i}$ are generated. Image-point contrastive (IPC) and intra-modal contrastive (IMC) narrow the gap between image encoding ${\phi _I}({z_i})$ and point encoding $\{ \phi _p^1({\tilde{x}_i}),\phi _p^2({\tilde{x}_i})\}$ for the same object. In CIL, novel encoders (with trainable layers) $\varphi _I^t( \cdot )$, $\varphi _P^t( \cdot )$ cooperate with the regularization item to tune the class prototypes in task t.
  • Figure 3: RRM illustration. Considering a point cloud as the input. Randomly masked meshes are projected with a differentiable renderer to generate multi-view images.
  • Figure 4: The classification accuracy $\mathcal{A}_b$ at each incremental step with different methods on m-MN40-Inc.4
  • Figure 5: The classification accuracy $\mathcal{A}_b$ at each incremental step with different methods on m-MN40-Inc.8
  • ...and 2 more figures