Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
Byeonghyun Pak, Byeongju Woo, Sunghwan Kim, Dae-hwan Kim, Hoseong Kim
TL;DR
This work tackles Domain Generalized Semantic Segmentation by leveraging domain-invariant semantics encoded in Vision-Language Model (VLM) text embeddings as textual object queries within a mask-transformer framework. The proposed tqdm framework generates and refines textual queries, integrates text-to-pixel attention to improve pixel-level semantic clarity, and employs three regularization losses to preserve robust vision-language alignment, achieving state-of-the-art results (e.g., $68.9$ mIoU on GTA5→Cityscapes). By demonstrating strong generalization to extreme domain shifts such as sketches, tqdm highlights the practical potential of language-driven DGSS. The approach offers a principled path to open-vocabulary, domain-robust segmentation by grounding dense predictions in domain-invariant textual semantics.
Abstract
In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-language models. We employ the text embeddings as object queries within a transformer-based segmentation framework (textual object queries). These queries are regarded as a domain-invariant basis for pixel grouping in DGSS. To leverage the power of textual object queries, we introduce a novel framework named the textual query-driven mask transformer (tqdm). Our tqdm aims to (1) generate textual object queries that maximally encode domain-invariant semantics and (2) enhance the semantic clarity of dense visual features. Additionally, we suggest three regularization losses to improve the efficacy of tqdm by aligning between visual and textual features. By utilizing our method, the model can comprehend inherent semantic information for classes of interest, enabling it to generalize to extreme domains (e.g., sketch style). Our tqdm achieves 68.9 mIoU on GTA5$\rightarrow$Cityscapes, outperforming the prior state-of-the-art method by 2.5 mIoU. The project page is available at https://byeonghyunpak.github.io/tqdm.
