Table of Contents
Fetching ...

CSAKD: Knowledge Distillation with Cross Self-Attention for Hyperspectral and Multispectral Image Fusion

Chih-Chung Hsu, Chih-Chien Ni, Chia-Ming Lee, Li-Wei Kang

TL;DR

This paper tackles efficient LR-HSI/HR-MSI fusion to produce HR-HSI on resource-constrained devices. It introduces CSAKD, a knowledge-distillation framework built around a Dual Two-Streamed (DTS) backbone, a Cross-Self-Attention (CSA) fusion module, and Cross-Layer Residual Aggregation (CLRA) blocks to jointly extract spatial and spectral features from LR-HSI and HR-MSI. A set of specialized losses, including Band-Energy-Balance-Aware (BEBA) loss, Spectral Angle Mapper (SAM) loss, and a response-based KD term, guides a lightweight student to approach the performance of a strong teacher. The approach yields comparable or superior fusion quality with significantly reduced model size and computation, demonstrates robustness under noise, and provides a practical path toward real-time hyperspectral fusion on limited-hardware platforms, with code available at GitHub.

Abstract

Hyperspectral imaging, capturing detailed spectral information for each pixel, is pivotal in diverse scientific and industrial applications. Yet, the acquisition of high-resolution (HR) hyperspectral images (HSIs) often needs to be addressed due to the hardware limitations of existing imaging systems. A prevalent workaround involves capturing both a high-resolution multispectral image (HR-MSI) and a low-resolution (LR) HSI, subsequently fusing them to yield the desired HR-HSI. Although deep learning-based methods have shown promising in HR-MSI/LR-HSI fusion and LR-HSI super-resolution (SR), their substantial model complexities hinder deployment on resource-constrained imaging devices. This paper introduces a novel knowledge distillation (KD) framework for HR-MSI/LR-HSI fusion to achieve SR of LR-HSI. Our KD framework integrates the proposed Cross-Layer Residual Aggregation (CLRA) block to enhance efficiency for constructing Dual Two-Streamed (DTS) network structure, designed to extract joint and distinct features from LR-HSI and HR-MSI simultaneously. To fully exploit the spatial and spectral feature representations of LR-HSI and HR-MSI, we propose a novel Cross Self-Attention (CSA) fusion module to adaptively fuse those features to improve the spatial and spectral quality of the reconstructed HR-HSI. Finally, the proposed KD-based joint loss function is employed to co-train the teacher and student networks. Our experimental results demonstrate that the student model not only achieves comparable or superior LR-HSI SR performance but also significantly reduces the model-size and computational requirements. This marks a substantial advancement over existing state-of-the-art methods. The source code is available at https://github.com/ming053l/CSAKD.

CSAKD: Knowledge Distillation with Cross Self-Attention for Hyperspectral and Multispectral Image Fusion

TL;DR

This paper tackles efficient LR-HSI/HR-MSI fusion to produce HR-HSI on resource-constrained devices. It introduces CSAKD, a knowledge-distillation framework built around a Dual Two-Streamed (DTS) backbone, a Cross-Self-Attention (CSA) fusion module, and Cross-Layer Residual Aggregation (CLRA) blocks to jointly extract spatial and spectral features from LR-HSI and HR-MSI. A set of specialized losses, including Band-Energy-Balance-Aware (BEBA) loss, Spectral Angle Mapper (SAM) loss, and a response-based KD term, guides a lightweight student to approach the performance of a strong teacher. The approach yields comparable or superior fusion quality with significantly reduced model size and computation, demonstrates robustness under noise, and provides a practical path toward real-time hyperspectral fusion on limited-hardware platforms, with code available at GitHub.

Abstract

Hyperspectral imaging, capturing detailed spectral information for each pixel, is pivotal in diverse scientific and industrial applications. Yet, the acquisition of high-resolution (HR) hyperspectral images (HSIs) often needs to be addressed due to the hardware limitations of existing imaging systems. A prevalent workaround involves capturing both a high-resolution multispectral image (HR-MSI) and a low-resolution (LR) HSI, subsequently fusing them to yield the desired HR-HSI. Although deep learning-based methods have shown promising in HR-MSI/LR-HSI fusion and LR-HSI super-resolution (SR), their substantial model complexities hinder deployment on resource-constrained imaging devices. This paper introduces a novel knowledge distillation (KD) framework for HR-MSI/LR-HSI fusion to achieve SR of LR-HSI. Our KD framework integrates the proposed Cross-Layer Residual Aggregation (CLRA) block to enhance efficiency for constructing Dual Two-Streamed (DTS) network structure, designed to extract joint and distinct features from LR-HSI and HR-MSI simultaneously. To fully exploit the spatial and spectral feature representations of LR-HSI and HR-MSI, we propose a novel Cross Self-Attention (CSA) fusion module to adaptively fuse those features to improve the spatial and spectral quality of the reconstructed HR-HSI. Finally, the proposed KD-based joint loss function is employed to co-train the teacher and student networks. Our experimental results demonstrate that the student model not only achieves comparable or superior LR-HSI SR performance but also significantly reduces the model-size and computational requirements. This marks a substantial advancement over existing state-of-the-art methods. The source code is available at https://github.com/ming053l/CSAKD.
Paper Structure (24 sections, 16 equations, 12 figures, 4 tables)

This paper contains 24 sections, 16 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: The brief illustration of proposed CSAKD framework by adaptively fusing the features of the LR-HSI and HR-MSI.
  • Figure 2: The proposed network architecture for HSI/MSI fusion based on the proposed Cross-Layer Residual Aggregation (CLRA) unit and Cross-Self-Attention (CSA) Fusion module. With the proposed Dual-two-Streamed (DTS) network, our network can judiciously learn the spatial-spectral representation across different branches. Afterwards, CSA enables network to adaptively fuse these representation, thereby yielding great results. By the proposed Knowledge Distillation (KD) manner, the network not only keep great performance, but reduce the model-size to fit real-world scenarios.
  • Figure 3: The proposed Cross self-attention (CSA) fusion module. The blue cube contains high-spatial information, and the other two contain relatively rich spectral information. The proposed attention module smartly considers the weight of different branches and fuses these representations together.
  • Figure 4: The overview of proposed CLRA. Each CLRA contains three CLRB and residual connection. As for CLRB, it is stacked by several convolution operators, such as LeakyReLU, dense, and residual connections.
  • Figure 5: Hyperspectral and Multispectral fusion images at AVIRIS dataset. The upper row is the fused RGB image, and the lower row is the residual image subtracted from Ground Truth : (a) the Ground Truth image ; (b) the Proposed method ; (c) PZRes-Net 24 ; (d) MSSJFL 23 ; (e) Dual-UNet 25 ; (f) DHIF-Net 26.
  • ...and 7 more figures