Table of Contents
Fetching ...

BihoT: A Large-Scale Dataset and Benchmark for Hyperspectral Camouflaged Object Tracking

Hanzheng Wang, Wei Li, Xiang-Gen Xia, Qian Du

TL;DR

The paper tackles Hyperspectral Camouflaged Object Tracking (HCOT) by introducing BihoT, a large-scale dataset with $41{,}912$ hyperspectral images across $49$ sequences of $25$ bands, designed to stress spectral discrimination over visual cues. It proposes SPDAN, a baseline that fuses spectral information via a Spectral Embedding Network (SEN), a Spectral Prompt-based Backbone Network (SPBN) with cross-modality adapters, and a Distractor-aware Module (DAM) to handle occlusions and background distractors, using a frozen visual transformer backbone for efficiency. Extensive experiments show SPDAN achieves state-of-the-art performance on BihoT and existing HOT datasets, with ablations confirming the effectiveness of SEN and DAM, and cross-dataset tests demonstrating good generalization. The work highlights the practical importance of leveraging spectral information in camouflaged-tracking scenarios and points to future work in incorporating temporal dynamics for further gains.

Abstract

Hyperspectral object tracking (HOT) has exhibited potential in various applications, particularly in scenes where objects are camouflaged. Existing trackers can effectively retrieve objects via band regrouping because of the bias in existing HOT datasets, where most objects tend to have distinguishing visual appearances rather than spectral characteristics. This bias allows the tracker to directly use the visual features obtained from the false-color images generated by hyperspectral images without the need to extract spectral features. To tackle this bias, we find that the tracker should focus on the spectral information when object appearance is unreliable. Thus, we provide a new task called hyperspectral camouflaged object tracking (HCOT) and meticulously construct a large-scale HCOT dataset, termed BihoT, which consists of 41,912 hyperspectral images covering 49 video sequences. The dataset covers various artificial camouflage scenes where objects have similar appearances, diverse spectrums, and frequent occlusion, making it a very challenging dataset for HCOT. Besides, a simple but effective baseline model, named spectral prompt-based distractor-aware network (SPDAN), is proposed, comprising a spectral embedding network (SEN), a spectral prompt-based backbone network (SPBN), and a distractor-aware module (DAM). Specifically, the SEN extracts spectral-spatial features via 3-D and 2-D convolutions. Then, the SPBN fine-tunes powerful RGB trackers with spectral prompts and alleviates the insufficiency of training samples. Moreover, the DAM utilizes a novel statistic to capture the distractor caused by occlusion from objects and background. Extensive experiments demonstrate that our proposed SPDAN achieves state-of-the-art performance on the proposed BihoT and other HOT datasets.

BihoT: A Large-Scale Dataset and Benchmark for Hyperspectral Camouflaged Object Tracking

TL;DR

The paper tackles Hyperspectral Camouflaged Object Tracking (HCOT) by introducing BihoT, a large-scale dataset with hyperspectral images across sequences of bands, designed to stress spectral discrimination over visual cues. It proposes SPDAN, a baseline that fuses spectral information via a Spectral Embedding Network (SEN), a Spectral Prompt-based Backbone Network (SPBN) with cross-modality adapters, and a Distractor-aware Module (DAM) to handle occlusions and background distractors, using a frozen visual transformer backbone for efficiency. Extensive experiments show SPDAN achieves state-of-the-art performance on BihoT and existing HOT datasets, with ablations confirming the effectiveness of SEN and DAM, and cross-dataset tests demonstrating good generalization. The work highlights the practical importance of leveraging spectral information in camouflaged-tracking scenarios and points to future work in incorporating temporal dynamics for further gains.

Abstract

Hyperspectral object tracking (HOT) has exhibited potential in various applications, particularly in scenes where objects are camouflaged. Existing trackers can effectively retrieve objects via band regrouping because of the bias in existing HOT datasets, where most objects tend to have distinguishing visual appearances rather than spectral characteristics. This bias allows the tracker to directly use the visual features obtained from the false-color images generated by hyperspectral images without the need to extract spectral features. To tackle this bias, we find that the tracker should focus on the spectral information when object appearance is unreliable. Thus, we provide a new task called hyperspectral camouflaged object tracking (HCOT) and meticulously construct a large-scale HCOT dataset, termed BihoT, which consists of 41,912 hyperspectral images covering 49 video sequences. The dataset covers various artificial camouflage scenes where objects have similar appearances, diverse spectrums, and frequent occlusion, making it a very challenging dataset for HCOT. Besides, a simple but effective baseline model, named spectral prompt-based distractor-aware network (SPDAN), is proposed, comprising a spectral embedding network (SEN), a spectral prompt-based backbone network (SPBN), and a distractor-aware module (DAM). Specifically, the SEN extracts spectral-spatial features via 3-D and 2-D convolutions. Then, the SPBN fine-tunes powerful RGB trackers with spectral prompts and alleviates the insufficiency of training samples. Moreover, the DAM utilizes a novel statistic to capture the distractor caused by occlusion from objects and background. Extensive experiments demonstrate that our proposed SPDAN achieves state-of-the-art performance on the proposed BihoT and other HOT datasets.
Paper Structure (21 sections, 17 equations, 9 figures, 13 tables)

This paper contains 21 sections, 17 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Differences between the BihoT dataset and the HOTC-2020 dataset. The object in the green box is a real kiwi, while the object in the red box is a fake kiwi, considered a camouflaged object. Data value refers to the value of a pixel, representing the intensity of the spectral reflectance curve.
  • Figure 2: Illustration of the proposed BihoT dataset. (a) Examples of spectral distinguishable (s-dis) factors from the BihoT dataset. (b) Examples of false-color images of the Kiwifruit3, Chill2, and Lemon2 video sequences from the BihoT dataset.
  • Figure 3: Illustration of the overall structure of our proposed SPDAN, including the spectral prompt-based backbone network (SPBN) and distractor-aware module (DAM). Specifically, SPBN contains three main modules, i.e., spectral embedding network (SEN), cross-modality adapter (CA), visual Transformer backbone (VTB), and head network (HN).
  • Figure 4: Illustration of the structure of CA.
  • Figure 5: Visualization of the decision confidence (DC) and the corresponding classification map (CM) of the basketball video sequence on the HOTC-2020 dataset. The line graph above represents the change in DC for each frame. It can be observed that when DC is below the threshold (i.e., frame #0055), multiple local extreme points in the CM appear, and the tracking results become unreliable. OD denotes the original images.
  • ...and 4 more figures