Table of Contents
Fetching ...

Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond

Guanyao Wu, Haoyu Liu, Hongming Fu, Yichuan Peng, Jinyuan Liu, Xin Fan, Risheng Liu

TL;DR

This work tackles the challenge of infrared–visible image fusion that must balance high-quality fusion with downstream perceptual tasks. It introduces SAGE, a framework that injects SAM-derived semantic priors through a Semantic Persistent Attention module and transfers this knowledge to a compact sub-network via a bi-level distillation scheme with triplet losses, enabling efficient inference without loading SAM at test time. The main components—the SPA module with a Persistent Repository and the bi-level distillation with $ \mathcal{L}_{\text{fea}}$, $ \mathcal{L}_{\text{grad}}$, $ \mathcal{L}_{\text{MSE}}$, and $ \mathcal{L}_{cs} $ losses—achieve improved fusion and segmentation across multiple datasets, while maintaining practical computation (e.g., $\sim$10 ms per image and low parameter count). The findings demonstrate that SAM semantic priors can be effectively leveraged to improve cross-modal fusion and downstream tasks without prohibitive runtime, thereby broadening the applicability of IVIF in real-world systems.

Abstract

Multi-modality image fusion, particularly infrared and visible, plays a crucial role in integrating diverse modalities to enhance scene understanding. Although early research prioritized visual quality, preserving fine details and adapting to downstream tasks remains challenging. Recent approaches attempt task-specific design but rarely achieve "The Best of Both Worlds" due to inconsistent optimization goals. To address these issues, we propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to Grow the quality of fusion results and Enable downstream task adaptability, namely SAGE. Specifically, we design a Semantic Persistent Attention (SPA) Module that efficiently maintains source information via the persistent repository while extracting high-level semantic priors from SAM. More importantly, to eliminate the impractical dependence on SAM during inference, we introduce a bi-level optimization-driven distillation mechanism with triplet losses, which allow the student network to effectively extract knowledge. Extensive experiments show that our method achieves a balance between high-quality visual results and downstream task adaptability while maintaining practical deployment efficiency. The code is available at https://github.com/RollingPlain/SAGE_IVIF.

Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond

TL;DR

This work tackles the challenge of infrared–visible image fusion that must balance high-quality fusion with downstream perceptual tasks. It introduces SAGE, a framework that injects SAM-derived semantic priors through a Semantic Persistent Attention module and transfers this knowledge to a compact sub-network via a bi-level distillation scheme with triplet losses, enabling efficient inference without loading SAM at test time. The main components—the SPA module with a Persistent Repository and the bi-level distillation with , , , and losses—achieve improved fusion and segmentation across multiple datasets, while maintaining practical computation (e.g., 10 ms per image and low parameter count). The findings demonstrate that SAM semantic priors can be effectively leveraged to improve cross-modal fusion and downstream tasks without prohibitive runtime, thereby broadening the applicability of IVIF in real-world systems.

Abstract

Multi-modality image fusion, particularly infrared and visible, plays a crucial role in integrating diverse modalities to enhance scene understanding. Although early research prioritized visual quality, preserving fine details and adapting to downstream tasks remains challenging. Recent approaches attempt task-specific design but rarely achieve "The Best of Both Worlds" due to inconsistent optimization goals. To address these issues, we propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to Grow the quality of fusion results and Enable downstream task adaptability, namely SAGE. Specifically, we design a Semantic Persistent Attention (SPA) Module that efficiently maintains source information via the persistent repository while extracting high-level semantic priors from SAM. More importantly, to eliminate the impractical dependence on SAM during inference, we introduce a bi-level optimization-driven distillation mechanism with triplet losses, which allow the student network to effectively extract knowledge. Extensive experiments show that our method achieves a balance between high-quality visual results and downstream task adaptability while maintaining practical deployment efficiency. The code is available at https://github.com/RollingPlain/SAGE_IVIF.

Paper Structure

This paper contains 13 sections, 6 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Differences between the proposed method and existing mainstream comparative approaches: (a) Traditional and early DL-based methods focus on the fusion visual effect. (b) Task-specific methods (e.g., TarDAL liu2022target & SegMiF liu2023multi) introduce task loss and features that lead to inconsistent optimization goals, causing a conflict between visual and task accuracy. (c) Our pipeline first leverages semantic priors from SAM within a large network and then distills the knowledge into a smaller sub-network achieving practical inference feasibility while ensuring "the best of both worlds" through SAM's inherent adaptability to these tasks.
  • Figure 2: Demonstration of SAM's robustness in MFNet normal scenes (top) and under challenging FMB conditions (bottom).
  • Figure 3: A overall workflow of our proposed method. (a) shows the flow structure of the main network, where the SPA module processes patches with semantic priors generated by SAM. (b) illustrates the detailed structure of the SPA module, where PR plays a key role in preserving the source and integrating the semantic information. (c) displays our distillation scheme formulation, with visualizations of the different components of the triplet loss. (d) provides a simple diagram of the sub-network, which is composed of stacked dense blocks.
  • Figure 4: Qualitative demonstrations of SOTA approaches across commonly used datasets, including TNO, RoadScene, M$^3$FD and FMB.
  • Figure 5: Quantitative comparison of fusion performance with other SOTA methods on M$^3$FD and FMB benchmarks. The violin plots illustrate the distribution of metrics, in which the the black lines and white triangles indicate the medium and mean values.
  • ...and 5 more figures