Table of Contents
Fetching ...

Modality Prompts for Arbitrary Modality Salient Object Detection

Nianchang Huang, Yang Yang, Qiang Zhang, Jungong Han, Jin Huang

TL;DR

This work tackles arbitrary modality salient object detection (AM SOD) by introducing a modality-adaptive Transformer (MAT) that can handle both diverse modality types and varying modality counts. It couples a modality-adaptive feature extractor (MAFE) with a modality translation contractive (MTC) loss to learn modality prompts that align the feature space to each input modality, enabling discriminative unimodal features without expanding parameters. For fusion, a channel-wise and spatial-wise fusion hybrid (CSFH) jointly leverages SDFM and CDFM to extract cross-modal complementary information across different feature levels. Evaluated on the AM-XD dataset, MAT demonstrates strong sole- and joint-mode performance, with ablations confirming the effectiveness of modality prompts, MTC, and CSFH in improving AM SOD by capturing both spatial and channel-wise cross-modal relationships.

Abstract

This paper delves into the task of arbitrary modality salient object detection (AM SOD), aiming to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD, ie more diverse modality discrepancies caused by varying modality types that need to be processed, and dynamic fusion design caused by an uncertain number of modalities present in the inputs of multimodal fusion strategy. Specifically, inspired by prompt learning's ability of aligning the distributions of pre-trained models to the characteristic of downstream tasks by learning some prompts, MAT will first present a modality-adaptive feature extractor (MAFE) to tackle the diverse modality discrepancies by introducing a modality prompt for each modality. In the training stage, a new modality translation contractive (MTC) loss will be further designed to assist MAFE in learning those modality-distinguishable modality prompts. Accordingly, in the testing stage, MAFE can employ those learned modality prompts to adaptively adjust its feature space according to the characteristics of the input modalities, thus being able to extract discriminative unimodal features. Then, MAFE will present a channel-wise and spatial-wise fusion hybrid (CSFH) strategy to meet the demand for dynamic fusion. For that, CSFH dedicates a channel-wise dynamic fusion module (CDFM) and a novel spatial-wise dynamic fusion module (SDFM) to fuse the unimodal features from varying numbers of modalities and meanwhile effectively capture cross-modal complementary semantic and detail information, respectively. Moreover, CSFH will carefully align CDFM and SDFM to different levels of unimodal features based on their characteristics for more effective complementary information exploitation.

Modality Prompts for Arbitrary Modality Salient Object Detection

TL;DR

This work tackles arbitrary modality salient object detection (AM SOD) by introducing a modality-adaptive Transformer (MAT) that can handle both diverse modality types and varying modality counts. It couples a modality-adaptive feature extractor (MAFE) with a modality translation contractive (MTC) loss to learn modality prompts that align the feature space to each input modality, enabling discriminative unimodal features without expanding parameters. For fusion, a channel-wise and spatial-wise fusion hybrid (CSFH) jointly leverages SDFM and CDFM to extract cross-modal complementary information across different feature levels. Evaluated on the AM-XD dataset, MAT demonstrates strong sole- and joint-mode performance, with ablations confirming the effectiveness of modality prompts, MTC, and CSFH in improving AM SOD by capturing both spatial and channel-wise cross-modal relationships.

Abstract

This paper delves into the task of arbitrary modality salient object detection (AM SOD), aiming to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD, ie more diverse modality discrepancies caused by varying modality types that need to be processed, and dynamic fusion design caused by an uncertain number of modalities present in the inputs of multimodal fusion strategy. Specifically, inspired by prompt learning's ability of aligning the distributions of pre-trained models to the characteristic of downstream tasks by learning some prompts, MAT will first present a modality-adaptive feature extractor (MAFE) to tackle the diverse modality discrepancies by introducing a modality prompt for each modality. In the training stage, a new modality translation contractive (MTC) loss will be further designed to assist MAFE in learning those modality-distinguishable modality prompts. Accordingly, in the testing stage, MAFE can employ those learned modality prompts to adaptively adjust its feature space according to the characteristics of the input modalities, thus being able to extract discriminative unimodal features. Then, MAFE will present a channel-wise and spatial-wise fusion hybrid (CSFH) strategy to meet the demand for dynamic fusion. For that, CSFH dedicates a channel-wise dynamic fusion module (CDFM) and a novel spatial-wise dynamic fusion module (SDFM) to fuse the unimodal features from varying numbers of modalities and meanwhile effectively capture cross-modal complementary semantic and detail information, respectively. Moreover, CSFH will carefully align CDFM and SDFM to different levels of unimodal features based on their characteristics for more effective complementary information exploitation.
Paper Structure (27 sections, 18 equations, 7 figures, 4 tables)

This paper contains 27 sections, 18 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparisons of different SOD tasks. (a) Single-modal RGB SOD. (b) Two-modal RGB-D/RGB-D SOD. (c) Three-modal RGB-D-T SOD. (D) A SOD.
  • Figure 2: Limitions of MSN. (a) Modality indicators for feature extraction. (b) Modality prompts for feature extraction.
  • Figure 3: Framework of our proposed modality-adaptive Transformer (MAT). We employ the three-modality RGB-D-T inputs as illustrative examples. First, the modality-adaptive feature extractor (MAFE) will receive an arbitrary modality image along with its corresponding modality prompt as inputs and then proceed to extract four distinct levels of unimodal features. After obtaining the unimodal features of RGB modality, depth modality, and Thermal modality, respectively, the channel-wise and spatial-wise fusion hybrid (CSFH) strategy will fuse these unimodal features by aligning SFDM and CFDM for different levels of unimodal features. Finally, MAT will leverage a saliency decoder to predict salient objects based on these fused features.
  • Figure 4: Diagram of our proposed MTC loss. Take the RGB images and thermal images as the example. The features extracted from an RGB image with an RGB prompt should have different distributions with the features extracted from a thermal image with a thermal prompt, but share similar distributions with the features extracted from a thermal image with a thermal prompt.
  • Figure 5: Architecture of our proposed SDFM.
  • ...and 2 more figures