USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation

Xiaoqi Wang; Wenbin He; Xiwei Xuan; Clint Sebastian; Jorge Piazentin Ono; Xin Li; Sima Behpour; Thang Doan; Liang Gou; Han Wei Shen; Liu Ren

USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation

Xiaoqi Wang, Wenbin He, Xiwei Xuan, Clint Sebastian, Jorge Piazentin Ono, Xin Li, Sima Behpour, Thang Doan, Liang Gou, Han Wei Shen, Liu Ren

TL;DR

The paper addresses open-vocabulary image segmentation by introducing the Universal Segment Embedding (USE) framework, which combines a data pipeline that automatically generates rich segment-text pairs with a lightweight embedding model that maps segments to a shared vision-language space. The data pipeline leverages multi-granularity captions, grounding, and SAM-based mask generation to produce extensive segment-text pairs without manual labeling, while the USE model fuses CLIP and DINOv2 features and is trained via a segment-text contrastive loss to produce discriminative segment embeddings. Empirical results on semantic and part segmentation benchmarks show that USE achieves state-of-the-art performance among open-vocabulary methods, with strong robustness to different data sources (COCO, VG) and ablation studies validating architectural choices such as backbone fusion and the inclusion of the CLS token. The work advances practical open-vocabulary segmentation by providing a scalable, zero-shot approach that also supports downstream tasks like querying and segment-based ranking, highlighting the value of data-centric strategies in vision-language grounding.

Abstract

The open-vocabulary image segmentation task involves partitioning images into semantically meaningful segments and classifying them with flexible text-defined categories. The recent vision-based foundation models such as the Segment Anything Model (SAM) have shown superior performance in generating class-agnostic image segments. The main challenge in open-vocabulary image segmentation now lies in accurately classifying these segments into text-defined categories. In this paper, we introduce the Universal Segment Embedding (USE) framework to address this challenge. This framework is comprised of two key components: 1) a data pipeline designed to efficiently curate a large amount of segment-text pairs at various granularities, and 2) a universal segment embedding model that enables precise segment classification into a vast range of text-defined categories. The USE model can not only help open-vocabulary image segmentation but also facilitate other downstream tasks (e.g., querying and ranking). Through comprehensive experimental studies on semantic segmentation and part segmentation benchmarks, we demonstrate that the USE framework outperforms state-of-the-art open-vocabulary segmentation methods.

USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation

TL;DR

Abstract

Paper Structure (12 sections, 1 equation, 7 figures, 5 tables)

This paper contains 12 sections, 1 equation, 7 figures, 5 tables.

Introduction
Related Work
Method
USE Data Pipeline
USE Model
Experiments
Datasets
Implementation Details
Open-Vocabulary Semantic Segmentation
Open-Vocabulary Part Segmentation
Ablation Study
Conclusion

Figures (7)

Figure 1: The proposed framework consists of two major components: a) data pipeline that generates segment-text pairs from image datasets and b) segment embedding model.
Figure 2: The overview of our data pipeline, which automatically constructs segment-text pairs at different levels of granularity. We design a unified data pipeline that curates data from different types of data sources while taking advantage of multiple foundation models to streamline the process.
Figure 3: The examples of the ground-truth caption, the caption generated with the commonly used image captioning prompt, and the caption generated with our prompt. Our prompt can guide the MLLM to generate captions with more fine-grained object parts.
Figure 4: The examples of generated box-text pairs if we query the grounding model with either the entire caption or individual noun phrases. Querying with the entire caption can help to accurately identify object parts by considering more context information.
Figure 5: Architecture of the USE model, which consists of: a) an image encoder to extract image features for local patches and b) a segment embedding head maps the image features to segment embeddings that are aligned with text descriptions. The USE model is trained with segment-text contrastive loss using the segment and text embeddings.
...and 2 more figures

USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation

TL;DR

Abstract

USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)