Table of Contents
Fetching ...

RTFDNet: Fusion-Decoupling for Robust RGB-T Segmentation

Kunyu Tan, Mingjian Liang

TL;DR

RTFDNet is presented, a three-branch encoder-decoder that unifies fusion and decoupling for robust RGB-T segmentation and demonstrates the effectiveness of RTFDNet, showing consistent performance across varying modality conditions.

Abstract

RGB-Thermal (RGB-T) semantic segmentation is essential for robotic systems operating in low-light or dark environments. However, traditional approaches often overemphasize modality balance, resulting in limited robustness and severe performance degradation when sensor signals are partially missing. Recent advances such as cross-modal knowledge distillation and modality-adaptive fine-tuning attempt to enhance cross-modal interaction, but they typically decouple modality fusion and modality adaptation, requiring multi-stage training with frozen models or teacher-student frameworks. We present RTFDNet, a three-branch encoder-decoder that unifies fusion and decoupling for robust RGB-T segmentation. Synergistic Feature Fusion (SFF) performs channel-wise gated exchange and lightweight spatial attention to inject complementary cues. Cross-Modal Decouple Regularization (CMDR) isolates modality-specific components from the fused representation and supervises unimodal decoders via stop-gradient targets. Region Decouple Regularization (RDR) enforces class-selective prediction consistency in confident regions while blocking gradients to the fusion branch. This feedback loop strengthens unimodal paths without degrading the fused stream, enabling efficient standalone inference at test time. Extensive experiments demonstrate the effectiveness of RTFDNet, showing consistent performance across varying modality conditions. Our implementation will be released to facilitate further research. Our source code are publicly available at https://github.com/curapima/RTFDNet.

RTFDNet: Fusion-Decoupling for Robust RGB-T Segmentation

TL;DR

RTFDNet is presented, a three-branch encoder-decoder that unifies fusion and decoupling for robust RGB-T segmentation and demonstrates the effectiveness of RTFDNet, showing consistent performance across varying modality conditions.

Abstract

RGB-Thermal (RGB-T) semantic segmentation is essential for robotic systems operating in low-light or dark environments. However, traditional approaches often overemphasize modality balance, resulting in limited robustness and severe performance degradation when sensor signals are partially missing. Recent advances such as cross-modal knowledge distillation and modality-adaptive fine-tuning attempt to enhance cross-modal interaction, but they typically decouple modality fusion and modality adaptation, requiring multi-stage training with frozen models or teacher-student frameworks. We present RTFDNet, a three-branch encoder-decoder that unifies fusion and decoupling for robust RGB-T segmentation. Synergistic Feature Fusion (SFF) performs channel-wise gated exchange and lightweight spatial attention to inject complementary cues. Cross-Modal Decouple Regularization (CMDR) isolates modality-specific components from the fused representation and supervises unimodal decoders via stop-gradient targets. Region Decouple Regularization (RDR) enforces class-selective prediction consistency in confident regions while blocking gradients to the fusion branch. This feedback loop strengthens unimodal paths without degrading the fused stream, enabling efficient standalone inference at test time. Extensive experiments demonstrate the effectiveness of RTFDNet, showing consistent performance across varying modality conditions. Our implementation will be released to facilitate further research. Our source code are publicly available at https://github.com/curapima/RTFDNet.
Paper Structure (18 sections, 11 equations, 6 figures, 6 tables)

This paper contains 18 sections, 11 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Architecture comparison of training paradigms for RGB–Thermal segmentation under missing modalities. (A) Two-stage knowledge distillation trains a multimodal teacher then distills to per-case students. (B) Modal-adaptation fine-tuning freezes the base and updates lightweight adapters to accommodate a dropped modality. (C) Our unified robustness training jointly optimizes RGB/Thermal encoders, a fusion branch, and multi-branch decoders with bidirectional consistency. At inference, we need to load only the corresponding encoder and decoder parameters.
  • Figure 2: Diagram of encoder–decoder model where SFF executes modality fusion, and CMDR/RDR decouple the fused features to regularize and guide each single-modality branch. (The robotic platform is a conceptual illustration.)
  • Figure 3: Visualizations of segmentation and feature maps generated by the RGB and Thermal decoders for specific regions.
  • Figure 4: The qualitative results on MFNet dataset.
  • Figure 5: The qualitative results on FMB dataset.
  • ...and 1 more figures