Table of Contents
Fetching ...

Salient Object Detection From Arbitrary Modalities

Nianchang Huang, Yang Yang, Ruida Xi, Qiang Zhang, Jungong Han, Jin Huang

TL;DR

The paper tackles Salient Object Detection under Arbitrary Modality (AM SOD), enabling a single model to process inputs with varying modality types and quantities. It introduces the Modality Switch Network (MSN), comprising a Modality Switch Feature Extractor (MSFE) and a Dynamic Fusion Module (DFM) to adaptively extract unimodal features and fuse them through Transformer-inspired cross-modal attention, followed by a Saliency Prediction Decoder (SPD). A novel AM-XD dataset is built to evaluate sole and joint modality settings across RGB, RGB-D, RGB-T, and RGB-D-T inputs. The results show that MSN achieves strong generalization across modalities, with ablations confirming the effectiveness of MSFE and DFM, suggesting significant practical value for flexible multimodal saliency tasks.

Abstract

Toward desirable saliency prediction, the types and numbers of inputs for a salient object detection (SOD) algorithm may dynamically change in many real-life applications. However, existing SOD algorithms are mainly designed or trained for one particular type of inputs, failing to be generalized to other types of inputs. Consequentially, more types of SOD algorithms need to be prepared in advance for handling different types of inputs, raising huge hardware and research costs. Differently, in this paper, we propose a new type of SOD task, termed Arbitrary Modality SOD (AM SOD). The most prominent characteristics of AM SOD are that the modality types and modality numbers will be arbitrary or dynamically changed. The former means that the inputs to the AM SOD algorithm may be arbitrary modalities such as RGB, depths, or even any combination of them. While, the latter indicates that the inputs may have arbitrary modality numbers as the input type is changed, e.g. single-modality RGB image, dual-modality RGB-Depth (RGB-D) images or triple-modality RGB-Depth-Thermal (RGB-D-T) images. Accordingly, a preliminary solution to the above challenges, ı.e. a modality switch network (MSN), is proposed in this paper. In particular, a modality switch feature extractor (MSFE) is first designed to extract discriminative features from each modality effectively by introducing some modality indicators, which will generate some weights for modality switching. Subsequently, a dynamic fusion module (DFM) is proposed to adaptively fuse features from a variable number of modalities based on a novel Transformer structure. Finally, a new dataset, named AM-XD, is constructed to facilitate research on AM SOD. Extensive experiments demonstrate that our AM SOD method can effectively cope with changes in the type and number of input modalities for robust salient object detection.

Salient Object Detection From Arbitrary Modalities

TL;DR

The paper tackles Salient Object Detection under Arbitrary Modality (AM SOD), enabling a single model to process inputs with varying modality types and quantities. It introduces the Modality Switch Network (MSN), comprising a Modality Switch Feature Extractor (MSFE) and a Dynamic Fusion Module (DFM) to adaptively extract unimodal features and fuse them through Transformer-inspired cross-modal attention, followed by a Saliency Prediction Decoder (SPD). A novel AM-XD dataset is built to evaluate sole and joint modality settings across RGB, RGB-D, RGB-T, and RGB-D-T inputs. The results show that MSN achieves strong generalization across modalities, with ablations confirming the effectiveness of MSFE and DFM, suggesting significant practical value for flexible multimodal saliency tasks.

Abstract

Toward desirable saliency prediction, the types and numbers of inputs for a salient object detection (SOD) algorithm may dynamically change in many real-life applications. However, existing SOD algorithms are mainly designed or trained for one particular type of inputs, failing to be generalized to other types of inputs. Consequentially, more types of SOD algorithms need to be prepared in advance for handling different types of inputs, raising huge hardware and research costs. Differently, in this paper, we propose a new type of SOD task, termed Arbitrary Modality SOD (AM SOD). The most prominent characteristics of AM SOD are that the modality types and modality numbers will be arbitrary or dynamically changed. The former means that the inputs to the AM SOD algorithm may be arbitrary modalities such as RGB, depths, or even any combination of them. While, the latter indicates that the inputs may have arbitrary modality numbers as the input type is changed, e.g. single-modality RGB image, dual-modality RGB-Depth (RGB-D) images or triple-modality RGB-Depth-Thermal (RGB-D-T) images. Accordingly, a preliminary solution to the above challenges, ı.e. a modality switch network (MSN), is proposed in this paper. In particular, a modality switch feature extractor (MSFE) is first designed to extract discriminative features from each modality effectively by introducing some modality indicators, which will generate some weights for modality switching. Subsequently, a dynamic fusion module (DFM) is proposed to adaptively fuse features from a variable number of modalities based on a novel Transformer structure. Finally, a new dataset, named AM-XD, is constructed to facilitate research on AM SOD. Extensive experiments demonstrate that our AM SOD method can effectively cope with changes in the type and number of input modalities for robust salient object detection.
Paper Structure (32 sections, 20 equations, 7 figures, 8 tables)

This paper contains 32 sections, 20 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: RGB SOD, RGB-D/T SOD vs AM SOD. (a) Existing SOD models. (b) Devices with multiple cameras. (c) AM SOD. The modality types and numbers for the inputs of existing SOD models must be fixed, while the modality types and numbers for our proposed AM SOD model may be arbitrary or changed.
  • Figure 2: Framework of our proposed MSN. The input images are first fed into the switch feature extractor to extract their unimodal features. Here, the number of input images is arbitrary. For better understanding, we display all the images of all modalities. However, the networks or features with dash lines are optional and may or may not exist. Then, the unimodal feature will be fused by using the dynamic fusion module. Finally, the saliency maps are obtained by using the saliency prediction decoder.
  • Figure 3: Framework of our proposed DFM.
  • Figure 4: Basic structure of the deconvolutional block.
  • Figure 5: Data structures and evaluation mode of our proposed AM-XD dataset. Here, we only draw the connection lines of RGB SOD setting, RGB-D SOD setting, RGB-D-T SOD setting of sole mode and joint mode for better understanding.
  • ...and 2 more figures