Table of Contents
Fetching ...

Universal Segmentation at Arbitrary Granularity with Language Instruction

Yong Liu, Cairong Zhang, Yitong Wang, Jiahao Wang, Yujiu Yang, Yansong Tang

TL;DR

Unified segmentation model UniLSeg is presented, a universal segmentation model that can perform segmentation at any semantic level with the guidance of language instructions, surpassing both specialist and unified segmentation models.

Abstract

This paper aims to achieve universal segmentation of arbitrary semantic level. Despite significant progress in recent years, specialist segmentation approaches are limited to specific tasks and data distribution. Retraining a new model for adaptation to new scenarios or settings takes expensive computation and time cost, which raises the demand for versatile and universal segmentation model that can cater to various granularity. Although some attempts have been made for unifying different segmentation tasks or generalization to various scenarios, limitations in the definition of paradigms and input-output spaces make it difficult for them to achieve accurate understanding of content at arbitrary granularity. To this end, we present UniLSeg, a universal segmentation model that can perform segmentation at any semantic level with the guidance of language instructions. For training UniLSeg, we reorganize a group of tasks from original diverse distributions into a unified data format, where images with texts describing segmentation targets as input and corresponding masks are output. Combined with a automatic annotation engine for utilizing numerous unlabeled data, UniLSeg achieves excellent performance on various tasks and settings, surpassing both specialist and unified segmentation models.

Universal Segmentation at Arbitrary Granularity with Language Instruction

TL;DR

Unified segmentation model UniLSeg is presented, a universal segmentation model that can perform segmentation at any semantic level with the guidance of language instructions, surpassing both specialist and unified segmentation models.

Abstract

This paper aims to achieve universal segmentation of arbitrary semantic level. Despite significant progress in recent years, specialist segmentation approaches are limited to specific tasks and data distribution. Retraining a new model for adaptation to new scenarios or settings takes expensive computation and time cost, which raises the demand for versatile and universal segmentation model that can cater to various granularity. Although some attempts have been made for unifying different segmentation tasks or generalization to various scenarios, limitations in the definition of paradigms and input-output spaces make it difficult for them to achieve accurate understanding of content at arbitrary granularity. To this end, we present UniLSeg, a universal segmentation model that can perform segmentation at any semantic level with the guidance of language instructions. For training UniLSeg, we reorganize a group of tasks from original diverse distributions into a unified data format, where images with texts describing segmentation targets as input and corresponding masks are output. Combined with a automatic annotation engine for utilizing numerous unlabeled data, UniLSeg achieves excellent performance on various tasks and settings, surpassing both specialist and unified segmentation models.
Paper Structure (16 sections, 2 equations, 6 figures, 8 tables)

This paper contains 16 sections, 2 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Illustration of our UniLSeg that is able to segment images at any granularity or semantic level with language as instructions. "Seg. Mask", "Lang.", and "Sem. Level" denote the segmentation masks, corresponding language descriptions, and semantic levels, respectively. The segmentation masks are shown in red or other colors. UniLSeg can take arbitrary text as input, whether it is a detailed long description of an object or a short category name. With flexible expressions indicating segmentation target, UniLSeg achieves excellent performance on various semantic level, e.g., object part, single or multiple instances, and the whole scene.
  • Figure 2: Pipeline of our UniLSeg. It takes both images and corresponding language prompt as input. With versatile language descriptions indicating segmentation targets and full visual-linguistic interactions, UniLSeg can perform segmentation at any semantic granularity and tackle various tasks such as semantic segmentation (SS), part segmentation (PS), salient object detection (SOD), open-vocabulary segmentation (OVS), referring image (RIS) and video object segmentation (RVOS).
  • Figure 3: Illustration of training data component. (a) shows the proportions of supervised source collected from different tasks. (b) demonstrates the component of pseudo labeled training source.
  • Figure 4: Visualization of segmentation results for different tasks.
  • Figure 5: Effect of incorporating 20% as well as 100% SA-1B data into training process under pre-training and joint training strategy.
  • ...and 1 more figures