Table of Contents
Fetching ...

Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation

Ruturaj Reddy, Hrishav Bakul Barua, Junn Yong Loo, Thanh Thi Nguyen, Ganesh Krishnasamy

TL;DR

This work addresses robust road-scene segmentation under challenging illumination for RGB-T data. It introduces CLARITY, a language-guided dynamic fusion framework that uses CLIP-based illumination cues $P_{cond}$ and scene embeddings $P_{emb}$ to gate a sparse Mixture-of-Experts fusion and to guide segmentation with semantic priors. A Soft-Gated Unbalanced Point Transformer and a Self-Calibrated Decoder enforce detail preservation and multi-scale consistency, while an edge-aware loss mitigates class imbalance. On MFNet, CLARITY achieves a new state-of-the-art of $62.3\%$ mIoU and $77.5\%$ mAcc, outperforming static fusion baselines and recent Transformers, demonstrating practical gains for autonomous driving under adverse lighting.

Abstract

Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms, i.e., one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.

Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation

TL;DR

This work addresses robust road-scene segmentation under challenging illumination for RGB-T data. It introduces CLARITY, a language-guided dynamic fusion framework that uses CLIP-based illumination cues and scene embeddings to gate a sparse Mixture-of-Experts fusion and to guide segmentation with semantic priors. A Soft-Gated Unbalanced Point Transformer and a Self-Calibrated Decoder enforce detail preservation and multi-scale consistency, while an edge-aware loss mitigates class imbalance. On MFNet, CLARITY achieves a new state-of-the-art of mIoU and mAcc, outperforming static fusion baselines and recent Transformers, demonstrating practical gains for autonomous driving under adverse lighting.

Abstract

Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms, i.e., one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.
Paper Structure (10 sections, 9 equations, 3 figures, 4 tables)

This paper contains 10 sections, 9 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: VLM generates a template text caption for each scene, which is encoded into a semantic object embedding (words in violet) regardless of illumination conditions. (a) In poor illumination, the RGB image suffers from motion blur, obscuring the pedestrian. The thermal image remains clear. The VLM identifies this as a "Total Darkness" condition (words in yellow), activating kernels (convolutional filters) that prioritize thermal data to avoid integrating RGB noise. (b) Conversely, the RGB image provides clear details and edges for objects like the person and guardrail in "Well-lit" driving scene. The VLM gates the model to prioritize RGB over thermal input.
  • Figure 2: Architecture of the proposed CLARITY method. The framework begins with a Semantic Condition Generator, where a VLM (CLIP) analyzes the scene to generate caption $T_c$. This process yields two key embeddings, i.e., an illumination condition embedding ($P_{cond}$) from $T_i$ derived for illumination conditions ($\textcolor{kernel}{\langle condition\rangle}$) and a holistic scene embedding ($P_{emb}$) derived from $T_c$ for detected objects ($\textcolor{embed}{\langle O_{detected}\rangle}$) and illumination condition. $P_{cond}$ serves as a gating signal for the Sparse MoE, dynamically routing inputs to degradation-specific Top-K Kernel Experts for adaptive fusion. The fused features are then enhanced by a Soft-Gated UPT to recover faint thermal details. Finally, a Self-Calibrated Decoder with Dilated Feature Aggregation (DFA) blocks captures multi-scale context before feeding the Segmentation and Edge Heads.
  • Figure 3: Segmented results generated by the proposed CLARITY and state-of-the-art methods on MFNet dataset 10234530.