Table of Contents
Fetching ...

RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation

Siju Ma, Changsiyu Gong, Xiaofeng Fan, Yong Ma, Chengjie Jiang

TL;DR

This work reframes text-driven infrared–visible image fusion as a referring image segmentation problem to obtain goal-aligned supervision for language-conditioned fusion. It introduces RIS-FUSION, a cascaded framework with LangGatedFusion that injects text features into both fusion and RIS stages, and jointly optimizes fusion with RIS loss. To support evaluation, MM-RIS provides a large-scale RIS benchmark for IR–VIS data with referring expressions and segmentation masks. Empirically, RIS-FUSION achieves state-of-the-art RIS performance and up to 11% relative gains in mIoU, while delivering clearer fused images, thereby establishing a new paradigm for aligning text-driven fusion with referential segmentation and offering a public dataset for future work.

Abstract

Text-driven infrared and visible image fusion has gained attention for enabling natural language to guide the fusion process. However, existing methods lack a goal-aligned task to supervise and evaluate how effectively the input text contributes to the fusion outcome. We observe that referring image segmentation (RIS) and text-driven fusion share a common objective: highlighting the object referred to by the text. Motivated by this, we propose RIS-FUSION, a cascaded framework that unifies fusion and RIS through joint optimization. At its core is the LangGatedFusion module, which injects textual features into the fusion backbone to enhance semantic alignment. To support multimodal referring image segmentation task, we introduce MM-RIS, a large-scale benchmark with 12.5k training and 3.5k testing triplets, each consisting of an infrared-visible image pair, a segmentation mask, and a referring expression. Extensive experiments show that RIS-FUSION achieves state-of-the-art performance, outperforming existing methods by over 11% in mIoU. Code and dataset will be released at https://github.com/SijuMa2003/RIS-FUSION.

RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation

TL;DR

This work reframes text-driven infrared–visible image fusion as a referring image segmentation problem to obtain goal-aligned supervision for language-conditioned fusion. It introduces RIS-FUSION, a cascaded framework with LangGatedFusion that injects text features into both fusion and RIS stages, and jointly optimizes fusion with RIS loss. To support evaluation, MM-RIS provides a large-scale RIS benchmark for IR–VIS data with referring expressions and segmentation masks. Empirically, RIS-FUSION achieves state-of-the-art RIS performance and up to 11% relative gains in mIoU, while delivering clearer fused images, thereby establishing a new paradigm for aligning text-driven fusion with referential segmentation and offering a public dataset for future work.

Abstract

Text-driven infrared and visible image fusion has gained attention for enabling natural language to guide the fusion process. However, existing methods lack a goal-aligned task to supervise and evaluate how effectively the input text contributes to the fusion outcome. We observe that referring image segmentation (RIS) and text-driven fusion share a common objective: highlighting the object referred to by the text. Motivated by this, we propose RIS-FUSION, a cascaded framework that unifies fusion and RIS through joint optimization. At its core is the LangGatedFusion module, which injects textual features into the fusion backbone to enhance semantic alignment. To support multimodal referring image segmentation task, we introduce MM-RIS, a large-scale benchmark with 12.5k training and 3.5k testing triplets, each consisting of an infrared-visible image pair, a segmentation mask, and a referring expression. Extensive experiments show that RIS-FUSION achieves state-of-the-art performance, outperforming existing methods by over 11% in mIoU. Code and dataset will be released at https://github.com/SijuMa2003/RIS-FUSION.

Paper Structure

This paper contains 12 sections, 7 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The overall architecture of the proposed RIS-FUSION framework.
  • Figure 2: Qualitative comparison of the multimodal referring image segmentation task. The referring text is "two color cones on the left side of the road".