Table of Contents
Fetching ...

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu

TL;DR

LMSeg tackles open-vocabulary semantic segmentation by bridging language and pixel-level understanding through three innovations: (1) LLM-generated, attribute-rich prompts that describe each category with color, shape/size, and texture/material; (2) a learnable fusion of CLIP visual features with SAM spatial features, yielding $F = w F_c + (1-w) F_s$; and (3) spatial- and class-level feature enhancement via Swin Transformer and a linear Transformer to produce refined cost maps for segmentation. The model computes a cost map $E$ from cosine similarities between visual features and text embeddings, $E_{(i,j,n)} = \frac{F_{(i,j)} \cdot T_n}{\|F_{(i,j)}\| \|T_n\|}$, with $E \in \mathbb{R}^{h \times w \times n}$, and uses an $L_{bce}$ loss for optimization. Across six open-vocabulary benchmarks, LMSeg achieves state-of-the-art results, with ablations showing comprehensive prompts, weighted fusion, and feature enhancement provide consistent gains. This work demonstrates that enriching language representations via LLMs and incorporating spatially aware visual encoders substantially improves pixel-precise open-vocabulary segmentation, offering practical impact for real-world scene understanding and zero-shot recognition.

Abstract

It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel-level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Specifically, our method employs large language models (LLMs) to generate enriched language prompts with diverse visual attributes for each category, including color, shape/size, and texture/material. Additionally, for enhanced visual feature extraction, the SAM model is adopted as a supplement to the CLIP visual encoder through a proposed learnable weighted fusion strategy. Built upon these techniques, our method, termed LMSeg, achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks. The code will be made available soon.

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

TL;DR

LMSeg tackles open-vocabulary semantic segmentation by bridging language and pixel-level understanding through three innovations: (1) LLM-generated, attribute-rich prompts that describe each category with color, shape/size, and texture/material; (2) a learnable fusion of CLIP visual features with SAM spatial features, yielding ; and (3) spatial- and class-level feature enhancement via Swin Transformer and a linear Transformer to produce refined cost maps for segmentation. The model computes a cost map from cosine similarities between visual features and text embeddings, , with , and uses an loss for optimization. Across six open-vocabulary benchmarks, LMSeg achieves state-of-the-art results, with ablations showing comprehensive prompts, weighted fusion, and feature enhancement provide consistent gains. This work demonstrates that enriching language representations via LLMs and incorporating spatially aware visual encoders substantially improves pixel-precise open-vocabulary segmentation, offering practical impact for real-world scene understanding and zero-shot recognition.

Abstract

It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel-level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Specifically, our method employs large language models (LLMs) to generate enriched language prompts with diverse visual attributes for each category, including color, shape/size, and texture/material. Additionally, for enhanced visual feature extraction, the SAM model is adopted as a supplement to the CLIP visual encoder through a proposed learnable weighted fusion strategy. Built upon these techniques, our method, termed LMSeg, achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks. The code will be made available soon.

Paper Structure

This paper contains 25 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Visualization of the cost map for different methods. The cost map represents the alignment between image and text features. The first row indicates the seen class 'person,' and the last two rows indicate the unseen classes 'bookcase' and 'sculpture.'
  • Figure 2: Illustration of our proposed LMSeg. Our method contains three parts: (a) comprehensive linguistic prompts generation, (b) visual feature extraction, and (c) feature enhancement. We first employ GPT-3.5 to generate comprehensive prompts for each category. Afterward, we fuse the SAM and CLIP visual features with a learnable weighted strategy. Finally, we perform feature enhancement on both spatial-level and class-level.
  • Figure 3: The pipeline of generating comprehensive linguistic prompt. We first ask the LLMs how to design prompts for semantic segmentation. Then, we utilize the answer to reason LLMs to generate detailed descriptions for each category.
  • Figure 4: The flow of feature enhancement. We first perform spatial-level feature enhancement and then aggregate class-level features.
  • Figure 5: Qualitative comparisons on PC-59 and VOC. PC-59 (1st two rows) and VOC (last two rows). From left to right: input images, results of CAT-Seg, results of our LMSeg, and ground truth.