Cross-Domain Semantic Segmentation with Large Language Model-Assisted Descriptor Generation
Philip Hughes, Larry Burns, Luke Adams
TL;DR
The paper addresses the challenge of generalizing semantic segmentation to diverse scenes and unseen categories by leveraging large language models to provide context-sensitive, fine-grained descriptors. LangSeg fuses an image encoder (ViT/CNN), a language encoder (GPT-3/BERT), and a decoder within a generative framework conditioned on $I$ and $L$, optimized with a total loss $\mathcal{L}_{total}$ that combines $\mathcal{L}_{gen}$, $\,\mathcal{L}_{triplet}$, $\mathcal{L}_{seg}$, and $\mathcal{L}_{multi-scale}$. The approach achieves state-of-the-art performance on ADE20K and COCO-Stuff, with up to $6.1\%$ improvements in $\text{mIoU}$ and strong results in both quantitative metrics and human evaluations, while ablation studies validate the importance of language guidance and multi-scale learning. These findings highlight the practical potential for interactive and domain-specific segmentation tasks, with future work aimed at prompt optimization and efficiency improvements to broaden real-time applicability.
Abstract
Semantic segmentation plays a crucial role in enabling machines to understand and interpret visual scenes at a pixel level. While traditional segmentation methods have achieved remarkable success, their generalization to diverse scenes and unseen object categories remains limited. Recent advancements in large language models (LLMs) offer a promising avenue for bridging visual and textual modalities, providing a deeper understanding of semantic relationships. In this paper, we propose LangSeg, a novel LLM-guided semantic segmentation method that leverages context-sensitive, fine-grained subclass descriptors generated by LLMs. Our framework integrates these descriptors with a pre-trained Vision Transformer (ViT) to achieve superior segmentation performance without extensive model retraining. We evaluate LangSeg on two challenging datasets, ADE20K and COCO-Stuff, where it outperforms state-of-the-art models, achieving up to a 6.1% improvement in mean Intersection over Union (mIoU). Additionally, we conduct a comprehensive ablation study and human evaluation to validate the effectiveness of our method in real-world scenarios. The results demonstrate that LangSeg not only excels in semantic understanding and contextual alignment but also provides a flexible and efficient framework for language-guided segmentation tasks. This approach opens up new possibilities for interactive and domain-specific segmentation applications.
