Table of Contents
Fetching ...

High-Quality Entity Segmentation

Lu Qi, Jason Kuen, Weidong Guo, Tiancheng Shen, Jiuxiang Gu, Jiaya Jia, Zhe Lin, Ming-Hsuan Yang

TL;DR

This work targets open-world, high-resolution dense segmentation by introducing the EntitySeg dataset and CropFormer. EntitySeg provides a large, diverse, high-quality set of pixel-perfect masks across in-the-wild domains, emphasizing high-resolution imagery. CropFormer is a Transformer-based, multi-view fusion framework that jointly leverages full-image context and high-resolution crops through a novel association module and batch-level decoder, enabling effective fusion of predictions from multiple views. Together, they improve segmentation accuracy across entity, instance, panoptic, and semantic tasks, and demonstrate strong generalization to in-the-wild and high-resolution settings, with broad potential for image editing and open-world recognition tasks.

Abstract

Dense image segmentation tasks e.g., semantic, panoptic) are useful for image editing, but existing methods can hardly generalize well in an in-the-wild setting where there are unrestricted image domains, classes, and image resolution and quality variations. Motivated by these observations, we construct a new entity segmentation dataset, with a strong focus on high-quality dense segmentation in the wild. The dataset contains images spanning diverse image domains and entities, along with plentiful high-resolution images and high-quality mask annotations for training and testing. Given the high-quality and -resolution nature of the dataset, we propose CropFormer which is designed to tackle the intractability of instance-level segmentation on high-resolution images. It improves mask prediction by fusing high-res image crops that provide more fine-grained image details and the full image. CropFormer is the first query-based Transformer architecture that can effectively fuse mask predictions from multiple image views, by learning queries that effectively associate the same entities across the full image and its crop. With CropFormer, we achieve a significant AP gain of $1.9$ on the challenging entity segmentation task. Furthermore, CropFormer consistently improves the accuracy of traditional segmentation tasks and datasets. The dataset and code will be released at http://luqi.info/entityv2.github.io/.

High-Quality Entity Segmentation

TL;DR

This work targets open-world, high-resolution dense segmentation by introducing the EntitySeg dataset and CropFormer. EntitySeg provides a large, diverse, high-quality set of pixel-perfect masks across in-the-wild domains, emphasizing high-resolution imagery. CropFormer is a Transformer-based, multi-view fusion framework that jointly leverages full-image context and high-resolution crops through a novel association module and batch-level decoder, enabling effective fusion of predictions from multiple views. Together, they improve segmentation accuracy across entity, instance, panoptic, and semantic tasks, and demonstrate strong generalization to in-the-wild and high-resolution settings, with broad potential for image editing and open-world recognition tasks.

Abstract

Dense image segmentation tasks e.g., semantic, panoptic) are useful for image editing, but existing methods can hardly generalize well in an in-the-wild setting where there are unrestricted image domains, classes, and image resolution and quality variations. Motivated by these observations, we construct a new entity segmentation dataset, with a strong focus on high-quality dense segmentation in the wild. The dataset contains images spanning diverse image domains and entities, along with plentiful high-resolution images and high-quality mask annotations for training and testing. Given the high-quality and -resolution nature of the dataset, we propose CropFormer which is designed to tackle the intractability of instance-level segmentation on high-resolution images. It improves mask prediction by fusing high-res image crops that provide more fine-grained image details and the full image. CropFormer is the first query-based Transformer architecture that can effectively fuse mask predictions from multiple image views, by learning queries that effectively associate the same entities across the full image and its crop. With CropFormer, we achieve a significant AP gain of on the challenging entity segmentation task. Furthermore, CropFormer consistently improves the accuracy of traditional segmentation tasks and datasets. The dataset and code will be released at http://luqi.info/entityv2.github.io/.
Paper Structure (29 sections, 7 equations, 5 figures, 10 tables)

This paper contains 29 sections, 7 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Segmentation results on in-the-wild test images: (1) Mask2Former cheng2022masked trained for panoptic segmentation kirillov2019panoptic on COCO dataset lin2014microsoft ; (2) Mask2Former trained for entity segmentation qi2021open on COCO dataset lin2014microsoft; (3) our CropFormer trained on the proposed EntitySeg Dataset. For a fair comparison, all three models use Swin-Large backbone liu2021swin and are trained until full convergence. Our approach provides far more desirable and useful results for many real-world applications.
  • Figure 2: High-quality mask annotations for low- and high-resolution images collected from existing datasets such as COCO lin2014microsoft, ADE20K zhou2017scene and Cityscapes cordts2016cityscapes as well as from the internet are presented. For the images collected from the aforementioned datasets, a visual comparison between the original and our annotations is provided in the middle and rightmost sub-figures, where the unannotated regions are shaded in black. It is worth noting that the RGB and mask images shown here have been downsampled, resulting in some quality degradation compared to the original datasets.
  • Figure 3: Sub-figures (a) and (b) present the distributions of image resolutions and average number of entities among ADE20K, COCO, and EntitySeg, respectively. Moreover, sub-figure (c) displays the distribution of image sources from which the EntitySeg images were collected.
  • Figure 4: Framework of the proposed CropFormer. The red box indicates cropped region randomly sampled from four fixed image corners. In image-level prediction, the same entity across different image views may be assigned to different queries. Our association module and batch decoder can effectively associate the same entity across different views with a single query.
  • Figure 5: The illustration of image- and batch-level decoder and association module.