Table of Contents
Fetching ...

Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model

Alaa Dalaq, Muzammil Behzad

TL;DR

Referring image segmentation requires precise cross-modal grounding between language and complex object boundaries. SegVLM introduces deformable-attentive visual enhancement, SE recalibration, residual fusion, and a Referring-Aware Fusion (RAF) loss to improve cross-modal alignment and boundary accuracy within a vision-language framework. The RAF loss combines BCE, Focal, and Adaptive Dice components to address class imbalance and hard boundary pixels, while deformable convolutions enable adaptive receptive fields and language-conditioned dynamic kernels. Evaluated on PhraseCut, SegVLM achieves an IoU of about $IoU \approx 53.9\%$ with strong boundary precision and maintains efficiency with ~63M parameters, demonstrating robust generalization to synthetic prompts and cross-domain datasets for practical deployment.

Abstract

Image segmentation is a fundamental task in computer vision, aimed at partitioning an image into semantically meaningful regions. Referring image segmentation extends this task by using natural language expressions to localize specific objects, requiring effective integration of visual and linguistic information. In this work, we propose SegVLM, a vision-language model that incorporates architectural improvements to enhance segmentation accuracy and cross-modal alignment. The model integrates squeeze-and-excitation (SE) blocks for dynamic feature recalibration, deformable convolutions for geometric adaptability, and residual connections for deep feature learning. We also introduce a novel referring-aware fusion (RAF) loss that balances region-level alignment, boundary precision, and class imbalance. Extensive experiments and ablation studies demonstrate that each component contributes to consistent performance improvements. SegVLM also shows strong generalization across diverse datasets and referring expression scenarios.

Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model

TL;DR

Referring image segmentation requires precise cross-modal grounding between language and complex object boundaries. SegVLM introduces deformable-attentive visual enhancement, SE recalibration, residual fusion, and a Referring-Aware Fusion (RAF) loss to improve cross-modal alignment and boundary accuracy within a vision-language framework. The RAF loss combines BCE, Focal, and Adaptive Dice components to address class imbalance and hard boundary pixels, while deformable convolutions enable adaptive receptive fields and language-conditioned dynamic kernels. Evaluated on PhraseCut, SegVLM achieves an IoU of about with strong boundary precision and maintains efficiency with ~63M parameters, demonstrating robust generalization to synthetic prompts and cross-domain datasets for practical deployment.

Abstract

Image segmentation is a fundamental task in computer vision, aimed at partitioning an image into semantically meaningful regions. Referring image segmentation extends this task by using natural language expressions to localize specific objects, requiring effective integration of visual and linguistic information. In this work, we propose SegVLM, a vision-language model that incorporates architectural improvements to enhance segmentation accuracy and cross-modal alignment. The model integrates squeeze-and-excitation (SE) blocks for dynamic feature recalibration, deformable convolutions for geometric adaptability, and residual connections for deep feature learning. We also introduce a novel referring-aware fusion (RAF) loss that balances region-level alignment, boundary precision, and class imbalance. Extensive experiments and ablation studies demonstrate that each component contributes to consistent performance improvements. SegVLM also shows strong generalization across diverse datasets and referring expression scenarios.

Paper Structure

This paper contains 20 sections, 8 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of SegVLM Vision-Language Segmentation. Given the input image and a referring expression (e.g., “Black baseball cap”), SegVLM outputs a segmented mask for the relevant region. The model integrates our proposed novelties such as RAF Loss, Deformable Convolution, Residual Connections, and SE Attention to refine segmentation performance.
  • Figure 2: The SegVLM architecture for referring image segmentation comprises an image encoder and a text encoder, whose outputs are fused by a vision-language decoder operating on visual and textual tokens with positional encoding. The fused cross-modal features are then refined using a Text & Image Projector, which aligns both modalities within a shared embedding space. To further enhance the visual feature representation, a Residual Block is incorporated, augmented with Squeeze-and-Excitation (SE) blocks and Deformable Convolutions. These additions enable adaptive channel recalibration and flexible spatial context modeling, collectively contributing to improved object localization and robust cross-modal grounding for complex referring expressions.
  • Figure 3: Ablation study showing the impact of each proposed enhancement on IoU, Prec@50, and Prec@90. Each component contributes incrementally to the final performance of SegVLM, with consistent improvements across all metrics, especially at stricter thresholds (Prec@90).
  • Figure 4: Precision across IoU thresholds (50–90). SegVLM consistently achieves higher precision at all thresholds, with particularly notable gains at stricter levels (Prec@80 and Prec@90), indicating improved boundary alignment and overall robustness.
  • Figure 5: Incremental improvements in IoU and precision scores (P@50–P@90) at each stage of model enhancement. Each bar illustrates the relative performance gain introduced by a specific component (RAF loss, deformable, convolution, residual connections, and SE Block), reaching the final SegVLM performance.
  • ...and 3 more figures