Dig2DIG: Dig into Diffusion Information Gains for Image Fusion
Bing Cao, Baoshuo Cai, Changqing Zhang, Qinghua Hu
TL;DR
This work tackles the challenge of dynamically fusing information from multiple modalities in diffusion-based image fusion by revealing a spatio-temporal imbalance in denoising. It introduces diffusion information gains (DIG) to quantify each modality's contribution at every denoising step and develops Dig2DIG, a dynamic fusion framework that weight-adjusts guidance to minimize a formal generalization error upper bound $\mathrm{GError}(F) \le C - \sum_{t=1}^T \sum_{k=1}^K \mathrm{Cov}(w_k, B)$. The authors prove that aligning fusion weights with the corresponding guidance contributions reduces generalization error and demonstrate DIG-driven dynamic weighting via $w_k(t) = \frac{\exp(DIG_k(t))}{\sum_j \exp(DIG_j(t))}$. Empirically, Dig2DIG achieves superior fusion quality and efficiency across visible-infrared, multi-focus, and multi-exposure tasks without additional training, validating both the theoretical guarantees and practical benefits of dynamic diffusion-based fusion.
Abstract
Image fusion integrates complementary information from multi-source images to generate more informative results. Recently, the diffusion model, which demonstrates unprecedented generative potential, has been explored in image fusion. However, these approaches typically incorporate predefined multimodal guidance into diffusion, failing to capture the dynamically changing significance of each modality, while lacking theoretical guarantees. To address this issue, we reveal a significant spatio-temporal imbalance in image denoising; specifically, the diffusion model produces dynamic information gains in different image regions with denoising steps. Based on this observation, we Dig into the Diffusion Information Gains (Dig2DIG) and theoretically derive a diffusion-based dynamic image fusion framework that provably reduces the upper bound of the generalization error. Accordingly, we introduce diffusion information gains (DIG) to quantify the information contribution of each modality at different denoising steps, thereby providing dynamic guidance during the fusion process. Extensive experiments on multiple fusion scenarios confirm that our method outperforms existing diffusion-based approaches in terms of both fusion quality and inference efficiency.
