Distilling Textual Priors from LLM to Efficient Image Fusion

Ran Zhang; Xuanhua He; Ke Cao; Liu Liu; Li Zhang; Man Zhou; Jie Zhang

Distilling Textual Priors from LLM to Efficient Image Fusion

Ran Zhang, Xuanhua He, Ke Cao, Liu Liu, Li Zhang, Man Zhou, Jie Zhang

TL;DR

Multi-modality image fusion suffers from high computational cost when leveraging text priors from large models. This work introduces a teacher–student framework that distills textual priors into a lightweight student, enabling inference without text guidance. A spatial-channel cross-fusion module and a tailored distillation loss enable the student to emulate the teacher's degradation-aware fusion with far fewer parameters. Experiments on IVF and medical fusion datasets show state-of-the-art performance with up to 90% parameter reduction and up to 98% faster inference, enabling practical deployment in real-world scenarios.

Abstract

Multi-modality image fusion aims to synthesize a single, comprehensive image from multiple source inputs. Traditional approaches, such as CNNs and GANs, offer efficiency but struggle to handle low-quality or complex inputs. Recent advances in text-guided methods leverage large model priors to overcome these limitations, but at the cost of significant computational overhead, both in memory and inference time. To address this challenge, we propose a novel framework for distilling large model priors, eliminating the need for text guidance during inference while dramatically reducing model size. Our framework utilizes a teacher-student architecture, where the teacher network incorporates large model priors and transfers this knowledge to a smaller student network via a tailored distillation process. Additionally, we introduce spatial-channel cross-fusion module to enhance the model's ability to leverage textual priors across both spatial and channel dimensions. Our method achieves a favorable trade-off between computational efficiency and fusion quality. The distilled network, requiring only 10% of the parameters and inference time of the teacher network, retains 90% of its performance and outperforms existing SOTA methods. Extensive experiments demonstrate the effectiveness of our approach. The implementation will be made publicly available as an open-source resource.

Distilling Textual Priors from LLM to Efficient Image Fusion

TL;DR

Abstract

Distilling Textual Priors from LLM to Efficient Image Fusion

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)