Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond
Guanyao Wu, Haoyu Liu, Hongming Fu, Yichuan Peng, Jinyuan Liu, Xin Fan, Risheng Liu
TL;DR
This work tackles the challenge of infrared–visible image fusion that must balance high-quality fusion with downstream perceptual tasks. It introduces SAGE, a framework that injects SAM-derived semantic priors through a Semantic Persistent Attention module and transfers this knowledge to a compact sub-network via a bi-level distillation scheme with triplet losses, enabling efficient inference without loading SAM at test time. The main components—the SPA module with a Persistent Repository and the bi-level distillation with $ \mathcal{L}_{\text{fea}}$, $ \mathcal{L}_{\text{grad}}$, $ \mathcal{L}_{\text{MSE}}$, and $ \mathcal{L}_{cs} $ losses—achieve improved fusion and segmentation across multiple datasets, while maintaining practical computation (e.g., $\sim$10 ms per image and low parameter count). The findings demonstrate that SAM semantic priors can be effectively leveraged to improve cross-modal fusion and downstream tasks without prohibitive runtime, thereby broadening the applicability of IVIF in real-world systems.
Abstract
Multi-modality image fusion, particularly infrared and visible, plays a crucial role in integrating diverse modalities to enhance scene understanding. Although early research prioritized visual quality, preserving fine details and adapting to downstream tasks remains challenging. Recent approaches attempt task-specific design but rarely achieve "The Best of Both Worlds" due to inconsistent optimization goals. To address these issues, we propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to Grow the quality of fusion results and Enable downstream task adaptability, namely SAGE. Specifically, we design a Semantic Persistent Attention (SPA) Module that efficiently maintains source information via the persistent repository while extracting high-level semantic priors from SAM. More importantly, to eliminate the impractical dependence on SAM during inference, we introduce a bi-level optimization-driven distillation mechanism with triplet losses, which allow the student network to effectively extract knowledge. Extensive experiments show that our method achieves a balance between high-quality visual results and downstream task adaptability while maintaining practical deployment efficiency. The code is available at https://github.com/RollingPlain/SAGE_IVIF.
