Table of Contents
Fetching ...

Continual Panoptic Perception: Towards Multi-modal Incremental Interpretation of Remote Sensing Images

Bo Yuan, Danpei Zhao, Zhuoran Liu, Wentao Li, Tian Li

TL;DR

Experimental results on the fine-grained panoptic perception dataset validate the effectiveness of the proposed model, and prove that joint optimization can boost sub-task CL efficiency with over 13% relative improvement on panoptic quality.

Abstract

Continual learning (CL) breaks off the one-way training manner and enables a model to adapt to new data, semantics and tasks continuously. However, current CL methods mainly focus on single tasks. Besides, CL models are plagued by catastrophic forgetting and semantic drift since the lack of old data, which often occurs in remote-sensing interpretation due to the intricate fine-grained semantics. In this paper, we propose Continual Panoptic Perception (CPP), a unified continual learning model that leverages multi-task joint learning covering pixel-level classification, instance-level segmentation and image-level perception for universal interpretation in remote sensing images. Concretely, we propose a collaborative cross-modal encoder (CCE) to extract the input image features, which supports pixel classification and caption generation synchronously. To inherit the knowledge from the old model without exemplar memory, we propose a task-interactive knowledge distillation (TKD) method, which leverages cross-modal optimization and task-asymmetric pseudo-labeling (TPL) to alleviate catastrophic forgetting. Furthermore, we also propose a joint optimization mechanism to achieve end-to-end multi-modal panoptic perception. Experimental results on the fine-grained panoptic perception dataset validate the effectiveness of the proposed model, and also prove that joint optimization can boost sub-task CL efficiency with over 13\% relative improvement on panoptic quality.

Continual Panoptic Perception: Towards Multi-modal Incremental Interpretation of Remote Sensing Images

TL;DR

Experimental results on the fine-grained panoptic perception dataset validate the effectiveness of the proposed model, and prove that joint optimization can boost sub-task CL efficiency with over 13% relative improvement on panoptic quality.

Abstract

Continual learning (CL) breaks off the one-way training manner and enables a model to adapt to new data, semantics and tasks continuously. However, current CL methods mainly focus on single tasks. Besides, CL models are plagued by catastrophic forgetting and semantic drift since the lack of old data, which often occurs in remote-sensing interpretation due to the intricate fine-grained semantics. In this paper, we propose Continual Panoptic Perception (CPP), a unified continual learning model that leverages multi-task joint learning covering pixel-level classification, instance-level segmentation and image-level perception for universal interpretation in remote sensing images. Concretely, we propose a collaborative cross-modal encoder (CCE) to extract the input image features, which supports pixel classification and caption generation synchronously. To inherit the knowledge from the old model without exemplar memory, we propose a task-interactive knowledge distillation (TKD) method, which leverages cross-modal optimization and task-asymmetric pseudo-labeling (TPL) to alleviate catastrophic forgetting. Furthermore, we also propose a joint optimization mechanism to achieve end-to-end multi-modal panoptic perception. Experimental results on the fine-grained panoptic perception dataset validate the effectiveness of the proposed model, and also prove that joint optimization can boost sub-task CL efficiency with over 13\% relative improvement on panoptic quality.
Paper Structure (27 sections, 13 equations, 6 figures, 7 tables)

This paper contains 27 sections, 13 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The proposed Continual Panoptic Perception (CPP) architecture. (a): Single-task CL methods only support separate training on different tasks. (b): CPP enables a shared encoder across multi-modal tasks, which also supports multi-task continual learning within a single model. (c): CPP achieves class-incremental pixel classification, instance segmentation and image captioning.
  • Figure 2: Illustration of the proposed CPP network. The input consists of the incremental images with corresponding mask annotation and specific text-format annotation. The output consists of the mask predictions for both old and new classes and image captioning result with new semantics.
  • Figure 3: Task-asymmetric pseudo-labeling. The asymmetric task reliance indicates the pseudo labels are cross-verified by more reliable predictions from multi-modal branches.
  • Figure 4: Qualitative visualization of the CPP before and after CL steps. The predictions are updated after CL steps on segmentation and captioning synchronously.
  • Figure 5: Comparison of PQ, SQ and RQ on all learned classes after all CL steps with different backbones on 15-5 task.
  • ...and 1 more figures