Segmenting Visuals With Querying Words: Language Anchors For Semi-Supervised Image Segmentation
Authors
Numair Nadeem, Saeed Anwar, Muhammad Hamza Asad, Abdul Bais
Abstract
Vision Language Models (VLMs) provide rich semantic priors but are underexplored in Semi supervised Semantic Segmentation. Recent attempts to integrate VLMs to inject high level semantics overlook the semantic misalignment between visual and textual representations that arises from using domain invariant text embeddings without adapting them to dataset and image specific contexts. This lack of domain awareness, coupled with limited annotations, weakens the model semantic understanding by preventing effective vision language alignment. As a result, the model struggles with contextual reasoning, shows weak intra class discrimination, and confuses similar classes. To address these challenges, we propose Hierarchical Vision Language transFormer (HVLFormer), which achieves domain aware and domain robust alignment between visual and textual representations within a mask transformer architecture. Firstly, we transform text embeddings from pretrained VLMs into textual object queries, enabling the generation of multi scale, dataset aware queries that capture class semantics from coarse to fine granularity and enhance contextual reasoning. Next, we refine these queries by injecting image specific visual context to align textual semantics with local scene structures and enhance class discrimination. Finally, to achieve domain robustness, we introduce cross view and modal consistency regularization, which enforces prediction consistency within mask-transformer architecture across augmented views. Moreover, it ensures stable vision language alignment during decoding. With less than 1% training data, HVLFormer outperforms state of the art methods on Pascal VOC, COCO, ADE20K, and Cityscapes. Our code and results will be available on GitHub.