Table of Contents
Fetching ...

Distilling Textual Priors from LLM to Efficient Image Fusion

Ran Zhang, Xuanhua He, Ke Cao, Liu Liu, Li Zhang, Man Zhou, Jie Zhang

TL;DR

Multi-modality image fusion suffers from high computational cost when leveraging text priors from large models. This work introduces a teacher–student framework that distills textual priors into a lightweight student, enabling inference without text guidance. A spatial-channel cross-fusion module and a tailored distillation loss enable the student to emulate the teacher's degradation-aware fusion with far fewer parameters. Experiments on IVF and medical fusion datasets show state-of-the-art performance with up to 90% parameter reduction and up to 98% faster inference, enabling practical deployment in real-world scenarios.

Abstract

Multi-modality image fusion aims to synthesize a single, comprehensive image from multiple source inputs. Traditional approaches, such as CNNs and GANs, offer efficiency but struggle to handle low-quality or complex inputs. Recent advances in text-guided methods leverage large model priors to overcome these limitations, but at the cost of significant computational overhead, both in memory and inference time. To address this challenge, we propose a novel framework for distilling large model priors, eliminating the need for text guidance during inference while dramatically reducing model size. Our framework utilizes a teacher-student architecture, where the teacher network incorporates large model priors and transfers this knowledge to a smaller student network via a tailored distillation process. Additionally, we introduce spatial-channel cross-fusion module to enhance the model's ability to leverage textual priors across both spatial and channel dimensions. Our method achieves a favorable trade-off between computational efficiency and fusion quality. The distilled network, requiring only 10% of the parameters and inference time of the teacher network, retains 90% of its performance and outperforms existing SOTA methods. Extensive experiments demonstrate the effectiveness of our approach. The implementation will be made publicly available as an open-source resource.

Distilling Textual Priors from LLM to Efficient Image Fusion

TL;DR

Multi-modality image fusion suffers from high computational cost when leveraging text priors from large models. This work introduces a teacher–student framework that distills textual priors into a lightweight student, enabling inference without text guidance. A spatial-channel cross-fusion module and a tailored distillation loss enable the student to emulate the teacher's degradation-aware fusion with far fewer parameters. Experiments on IVF and medical fusion datasets show state-of-the-art performance with up to 90% parameter reduction and up to 98% faster inference, enabling practical deployment in real-world scenarios.

Abstract

Multi-modality image fusion aims to synthesize a single, comprehensive image from multiple source inputs. Traditional approaches, such as CNNs and GANs, offer efficiency but struggle to handle low-quality or complex inputs. Recent advances in text-guided methods leverage large model priors to overcome these limitations, but at the cost of significant computational overhead, both in memory and inference time. To address this challenge, we propose a novel framework for distilling large model priors, eliminating the need for text guidance during inference while dramatically reducing model size. Our framework utilizes a teacher-student architecture, where the teacher network incorporates large model priors and transfers this knowledge to a smaller student network via a tailored distillation process. Additionally, we introduce spatial-channel cross-fusion module to enhance the model's ability to leverage textual priors across both spatial and channel dimensions. Our method achieves a favorable trade-off between computational efficiency and fusion quality. The distilled network, requiring only 10% of the parameters and inference time of the teacher network, retains 90% of its performance and outperforms existing SOTA methods. Extensive experiments demonstrate the effectiveness of our approach. The implementation will be made publicly available as an open-source resource.

Paper Structure

This paper contains 27 sections, 17 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of different image fusion methods and their parameter efficiency. (1) Traditional methods use small fusion networks. (2) Text-guided methods significantly increasing computational demands with performance improvements. (3) Our proposed method leverages text-guided training and knowledge distillation to create a distilled network that achieves high-quality fused images without relying on LLMs during inference.
  • Figure 2: Overview of our text-guided image fusion framework. The architecture consists of three main components: (1) Text Guidance Module that leverages LLMs and CLIP to generate semantic guidance; (2) Encoder that processes visible and infrared inputs through dual-stream transformers with TSAB and SSAB blocks, followed by cross-modal fusion; (3) Decoder and Refinement that progressively reconstructs the fused image with text-guided feature modulation. Gray components are removed in the distilled student network. TSAB: Transposed Self-Attention Block, SSAB: Spatial Self-Attention Block.
  • Figure 3: Qualitative comparison of different image fusion methods on a challenging scene (FLIR_05767.jpg) from the RoadScene dataset. Our method better preserves thermal information from infrared images while maintaining visible details and natural appearance, especially in scenarios with extreme lighting conditions or complex textures. For a more exhaustive visual comparison across various scenarios and methods, please refer to Fig. \ref{['fig:ivf_compare_full']}.
  • Figure 4: Comprehensive visual comparison of different image fusion methods on Harvard Medical Image Fusion Datasets (PET-MRI, CT-MRI, SPECT-MRI). For each set of results, from left to right: Modality 1 Input, Modality 2 Input, Text-IF Output, Ours-Teacher Output, Ours-Distilled Output.
  • Figure 5: Overview of our ablation study.
  • ...and 3 more figures