Table of Contents
Fetching ...

MMDRFuse: Distilled Mini-Model with Dynamic Refresh for Multi-Modality Image Fusion

Yanglin Deng, Tianyang Xu, Chunyang Cheng, Xiao-Jun Wu, Josef Kittler

TL;DR

MMDRFuse tackles the trade-off between fusion quality and computational cost in multi-modality image fusion by learning a 113-parameter mini-model (0.44 KB) through distillation from a strong teacher, a comprehensive loss that fuses pixel, gradient, and perceptual cues, and a dynamic refresh strategy that leverages historical training states. The three innovations—digestible distillation, a multi-faceted loss, and adaptive history-guided supervision—enable end-to-end training of a tiny model that still performs robustly on infrared-visible and medical fusion tasks, as well as supporting pedestrian detection. Ablation studies confirm that each component contributes to performance, and cross-task evaluations show MMDRFuse achieving competitive or superior results with dramatically reduced parameters and runtime. The approach offers practical impact for real-time MMIF applications and downstream processing, with code publicly available.

Abstract

In recent years, Multi-Modality Image Fusion (MMIF) has been applied to many fields, which has attracted many scholars to endeavour to improve the fusion performance. However, the prevailing focus has predominantly been on the architecture design, rather than the training strategies. As a low-level vision task, image fusion is supposed to quickly deliver output images for observation and supporting downstream tasks. Thus, superfluous computational and storage overheads should be avoided. In this work, a lightweight Distilled Mini-Model with a Dynamic Refresh strategy (MMDRFuse) is proposed to achieve this objective. To pursue model parsimony, an extremely small convolutional network with a total of 113 trainable parameters (0.44 KB) is obtained by three carefully designed supervisions. First, digestible distillation is constructed by emphasising external spatial feature consistency, delivering soft supervision with balanced details and saliency for the target network. Second, we develop a comprehensive loss to balance the pixel, gradient, and perception clues from the source images. Third, an innovative dynamic refresh training strategy is used to collaborate history parameters and current supervision during training, together with an adaptive adjust function to optimise the fusion network. Extensive experiments on several public datasets demonstrate that our method exhibits promising advantages in terms of model efficiency and complexity, with superior performance in multiple image fusion tasks and downstream pedestrian detection application. The code of this work is publicly available at https://github.com/yanglinDeng/MMDRFuse.

MMDRFuse: Distilled Mini-Model with Dynamic Refresh for Multi-Modality Image Fusion

TL;DR

MMDRFuse tackles the trade-off between fusion quality and computational cost in multi-modality image fusion by learning a 113-parameter mini-model (0.44 KB) through distillation from a strong teacher, a comprehensive loss that fuses pixel, gradient, and perceptual cues, and a dynamic refresh strategy that leverages historical training states. The three innovations—digestible distillation, a multi-faceted loss, and adaptive history-guided supervision—enable end-to-end training of a tiny model that still performs robustly on infrared-visible and medical fusion tasks, as well as supporting pedestrian detection. Ablation studies confirm that each component contributes to performance, and cross-task evaluations show MMDRFuse achieving competitive or superior results with dramatically reduced parameters and runtime. The approach offers practical impact for real-time MMIF applications and downstream processing, with code publicly available.

Abstract

In recent years, Multi-Modality Image Fusion (MMIF) has been applied to many fields, which has attracted many scholars to endeavour to improve the fusion performance. However, the prevailing focus has predominantly been on the architecture design, rather than the training strategies. As a low-level vision task, image fusion is supposed to quickly deliver output images for observation and supporting downstream tasks. Thus, superfluous computational and storage overheads should be avoided. In this work, a lightweight Distilled Mini-Model with a Dynamic Refresh strategy (MMDRFuse) is proposed to achieve this objective. To pursue model parsimony, an extremely small convolutional network with a total of 113 trainable parameters (0.44 KB) is obtained by three carefully designed supervisions. First, digestible distillation is constructed by emphasising external spatial feature consistency, delivering soft supervision with balanced details and saliency for the target network. Second, we develop a comprehensive loss to balance the pixel, gradient, and perception clues from the source images. Third, an innovative dynamic refresh training strategy is used to collaborate history parameters and current supervision during training, together with an adaptive adjust function to optimise the fusion network. Extensive experiments on several public datasets demonstrate that our method exhibits promising advantages in terms of model efficiency and complexity, with superior performance in multiple image fusion tasks and downstream pedestrian detection application. The code of this work is publicly available at https://github.com/yanglinDeng/MMDRFuse.
Paper Structure (23 sections, 13 equations, 7 figures, 5 tables)

This paper contains 23 sections, 13 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of our distillation process. TConv1, TConv2, ... , TConv13 represent the convolutional layers in the teacher network. SConv1, SConv2 represent the convolutional layers in the student network. TOutput and SOutput denote their outputs respectively.
  • Figure 2: Illustration of the feature maps used to reflect perception degrees. From left to right, it represents the source image, duplicated source image, and five feature maps extracted by VGG-19, respectively.
  • Figure 3: Workflow of the proposed dynamic refresh strategy. The green and orange databases denote the best historical fusion outputs measured by SSIM and GMSD, respectively. The centre represents the process of calculating $S_{bs}$ and $S_{cur}$, $G_{bg}$ and $G_{cur}$. The brains symbolize the process of calculating two loss items. The black dashed line denotes the process of utilising $L_{refresh}$ to supervise the training process.
  • Figure 4: Visual comparison with SOTA approaches on MSRS.
  • Figure 5: Visual comparison with SOTA approaches on LLVIP.
  • ...and 2 more figures