Table of Contents
Fetching ...

Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

Byeonghyun Pak, Byeongju Woo, Sunghwan Kim, Dae-hwan Kim, Hoseong Kim

TL;DR

This work tackles Domain Generalized Semantic Segmentation by leveraging domain-invariant semantics encoded in Vision-Language Model (VLM) text embeddings as textual object queries within a mask-transformer framework. The proposed tqdm framework generates and refines textual queries, integrates text-to-pixel attention to improve pixel-level semantic clarity, and employs three regularization losses to preserve robust vision-language alignment, achieving state-of-the-art results (e.g., $68.9$ mIoU on GTA5→Cityscapes). By demonstrating strong generalization to extreme domain shifts such as sketches, tqdm highlights the practical potential of language-driven DGSS. The approach offers a principled path to open-vocabulary, domain-robust segmentation by grounding dense predictions in domain-invariant textual semantics.

Abstract

In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-language models. We employ the text embeddings as object queries within a transformer-based segmentation framework (textual object queries). These queries are regarded as a domain-invariant basis for pixel grouping in DGSS. To leverage the power of textual object queries, we introduce a novel framework named the textual query-driven mask transformer (tqdm). Our tqdm aims to (1) generate textual object queries that maximally encode domain-invariant semantics and (2) enhance the semantic clarity of dense visual features. Additionally, we suggest three regularization losses to improve the efficacy of tqdm by aligning between visual and textual features. By utilizing our method, the model can comprehend inherent semantic information for classes of interest, enabling it to generalize to extreme domains (e.g., sketch style). Our tqdm achieves 68.9 mIoU on GTA5$\rightarrow$Cityscapes, outperforming the prior state-of-the-art method by 2.5 mIoU. The project page is available at https://byeonghyunpak.github.io/tqdm.

Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

TL;DR

This work tackles Domain Generalized Semantic Segmentation by leveraging domain-invariant semantics encoded in Vision-Language Model (VLM) text embeddings as textual object queries within a mask-transformer framework. The proposed tqdm framework generates and refines textual queries, integrates text-to-pixel attention to improve pixel-level semantic clarity, and employs three regularization losses to preserve robust vision-language alignment, achieving state-of-the-art results (e.g., mIoU on GTA5→Cityscapes). By demonstrating strong generalization to extreme domain shifts such as sketches, tqdm highlights the practical potential of language-driven DGSS. The approach offers a principled path to open-vocabulary, domain-robust segmentation by grounding dense predictions in domain-invariant textual semantics.

Abstract

In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-language models. We employ the text embeddings as object queries within a transformer-based segmentation framework (textual object queries). These queries are regarded as a domain-invariant basis for pixel grouping in DGSS. To leverage the power of textual object queries, we introduce a novel framework named the textual query-driven mask transformer (tqdm). Our tqdm aims to (1) generate textual object queries that maximally encode domain-invariant semantics and (2) enhance the semantic clarity of dense visual features. Additionally, we suggest three regularization losses to improve the efficacy of tqdm by aligning between visual and textual features. By utilizing our method, the model can comprehend inherent semantic information for classes of interest, enabling it to generalize to extreme domains (e.g., sketch style). Our tqdm achieves 68.9 mIoU on GTA5Cityscapes, outperforming the prior state-of-the-art method by 2.5 mIoU. The project page is available at https://byeonghyunpak.github.io/tqdm.
Paper Structure (19 sections, 9 equations, 14 figures, 4 tables)

This paper contains 19 sections, 9 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: (a) A collection of driving scene images with diverse styles generated by ChatGPTfn:gpt. (b) The image-text similarity maps of a pre-trained VLM (i.e., EVA02-CLIP sun2023eva) on diverse domains. The text embedding of 'car' is consistently well-aligned with the corresponding class regions of images across various domains. (c) The segmentation results predicted by our proposed tqdm. Note that our model can generalize to extreme domains (e.g., sketch style) and effectively identify the cars in multiple forms that are not present in the source domain (i.e., GTA5 richter2016playing).
  • Figure 2: Effectiveness of textual object query. (a) In all DGSS benchmarks, textual object queries ($\textbf{q}_\text{text}$) outperforms randomly initialized object queries ($\textbf{q}_\text{rand}$). (b) We visualize the mask predictions corresponding to a class (i.e., 'bicycle'), derived from $\textbf{q}_\text{rand}$ on the left and $\textbf{q}_\text{text}$ on the right, respectively. $\textbf{q}_\text{rand}$ yields a degraded result, whereas $\textbf{q}_\text{text}$ produces a robust one on a unseen domain.
  • Figure 3: Overall pipeline of tqdm. (Step 1) We generate initial textual object queries $\textbf{q}^0_\textbf{t}$ from the $K$ class text embeddings $\{\textbf{t}_k\}^K_{k=1}$. (Step 2) To improve the segmentation capabilities of these queries, we incorporate text-to-pixel attention within the pixel decoder. This process enhances the semantic clarity of pixel features, while reconstructing high-resolution per-pixel embeddings $\textbf{Z}$. (Step 3) The transformer decoder refines these queries for the final prediction. Each prediction output is then assigned to its corresponding ground truth (GT) through fixed matching, ensuring that each query consistently represents the semantic information of one class.
  • Figure 4: Three regularization losses to enhance the efficacy of tqdm. (a) Language regularization prevents the learnable prompts from distorting the semantic meaning of text embeddings. (b) Vision-language regularization aims to align visual and textual features at the pixel-level. (c) Vision regularization maintains the ability of the vision encoder to align with textual information at the image-level.
  • Figure 5: Precision-recall curves of region proposals for the rarest classes and class-wise IoU results. (a) For the rarest classes (i.e., 'train,' 'motorcycle,' 'rider,' and 'bicycle'), our tqdm with textual object query produces more robust region proposals than the baseline with randomly initialized object query. (b) This enhanced robustness leads to the superior DGSS performances of our tqdm for these classes. The red colors visualize the differences in IoU between the baseline and tqdm.
  • ...and 9 more figures