Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

Wenxuan Wang; Tongtian Yue; Yisi Zhang; Longteng Guo; Xingjian He; Xinlong Wang; Jing Liu

Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

Wenxuan Wang, Tongtian Yue, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, Jing Liu

TL;DR

This paper puts forward a new multi-granularity referring expression segmentation (MRES) task and construct an evaluation benchmark called RefCOCOm by manual annotations, and builds the largest visual grounding dataset namely MRES-32M, which comprises over 32.2M high-quality masks and captions on the provided 1M images.

Abstract

Referring expression segmentation (RES) aims at segmenting the foreground masks of the entities that match the descriptive natural language expression. Previous datasets and methods for classic RES task heavily rely on the prior assumption that one expression must refer to object-level targets. In this paper, we take a step further to finer-grained part-level RES task. To promote the object-level RES task towards finer-grained vision-language understanding, we put forward a new multi-granularity referring expression segmentation (MRES) task and construct an evaluation benchmark called RefCOCOm by manual annotations. By employing our automatic model-assisted data engine, we build the largest visual grounding dataset namely MRES-32M, which comprises over 32.2M high-quality masks and captions on the provided 1M images. Besides, a simple yet strong model named UniRES is designed to accomplish the unified object-level and part-level grounding task. Extensive experiments on our RefCOCOm for MRES and three datasets (i.e., RefCOCO(+/g) for classic RES task demonstrate the superiority of our method over previous state-of-the-art methods. To foster future research into fine-grained visual grounding, our benchmark RefCOCOm, the MRES-32M dataset and model UniRES will be publicly available at https://github.com/Rubics-Xuan/MRES

Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

TL;DR

Abstract

Paper Structure (25 sections, 9 figures, 8 tables)

This paper contains 25 sections, 9 figures, 8 tables.

Introduction
Related Work
Multi-Granularity Grounding Benchmark
Multi-Granularity RES Task
RefCOCOm Benchmark
Multi-Granularity Grounding Dataset
Data Collection Engine
MRES-32M Dataset Details
Multi-Granularity RES Model
Experimental Results
Datasets
Experimental Setup
Main Results
Multi-Granularity MRES Task
Classic Object-Level RES Task
...and 10 more sections

Figures (9)

Figure 1: Classic Referring Expression Segmentation (RES) only supports expressions that indicate a single target object, e.g., (a). Compared with classic RES, the proposed Multi-Granularity Referring Expression Segmentation (MRES) task supports expressions indicating the specific part-level regions of target objects, e.g., part-level expressions like (b)-(e) from our newly built RefCOCOm benchmark.
Figure 2: RefCOCOm benchmark statistics. (a) the number of referring expressions per parts' category in the log scale. (b) the word cloud highlights the head categories.
Figure 3: The illustration of our data engine for building the MRES-32M dataset. (a) We start by fine-tuning an LVLM to create a capable dense captioner, which can effectively handle captioning tasks at three levels of granularity. (b) To generate object-level grounding data, we feed images and original bounding boxes into the dense captioner and a powerful segmenter to obtain the captions and masks of various objects. (c) We leverage the external knowledge from LLMs to decompose the existing object category annotations into a vocabulary set of part-level tags, which are sequentially fed into an open-vocabulary segmenter and our captioner to acquire the part-level annotations.
Figure 4: The architecture of our UniRES model as a simple baseline for the MRES task. UniRES mainly comprises three parts: the visual and textual backbone for feature extraction, the pixel grouping design for aggregating the low-level and high-level features, and the cascaded two-stage vision-language decoder for multimodal feature fusion and the generation of segmentation masks.
Figure 5: Qualitative analysis for ablation study on the object-level and part-level grouping design in our model structure. (a) the input image. (b) low-level group tokens. (c) high-level group tokens.
...and 4 more figures

Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

TL;DR

Abstract

Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)