Deformable Attentive Visual Enhancement for Referring Segmentation Using Vision-Language Model
Alaa Dalaq, Muzammil Behzad
TL;DR
Referring image segmentation requires precise cross-modal grounding between language and complex object boundaries. SegVLM introduces deformable-attentive visual enhancement, SE recalibration, residual fusion, and a Referring-Aware Fusion (RAF) loss to improve cross-modal alignment and boundary accuracy within a vision-language framework. The RAF loss combines BCE, Focal, and Adaptive Dice components to address class imbalance and hard boundary pixels, while deformable convolutions enable adaptive receptive fields and language-conditioned dynamic kernels. Evaluated on PhraseCut, SegVLM achieves an IoU of about $IoU \approx 53.9\%$ with strong boundary precision and maintains efficiency with ~63M parameters, demonstrating robust generalization to synthetic prompts and cross-domain datasets for practical deployment.
Abstract
Image segmentation is a fundamental task in computer vision, aimed at partitioning an image into semantically meaningful regions. Referring image segmentation extends this task by using natural language expressions to localize specific objects, requiring effective integration of visual and linguistic information. In this work, we propose SegVLM, a vision-language model that incorporates architectural improvements to enhance segmentation accuracy and cross-modal alignment. The model integrates squeeze-and-excitation (SE) blocks for dynamic feature recalibration, deformable convolutions for geometric adaptability, and residual connections for deep feature learning. We also introduce a novel referring-aware fusion (RAF) loss that balances region-level alignment, boundary precision, and class imbalance. Extensive experiments and ablation studies demonstrate that each component contributes to consistent performance improvements. SegVLM also shows strong generalization across diverse datasets and referring expression scenarios.
