Table of Contents
Fetching ...

Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities

Jing Liu, Wenxuan Wang, Yisi Zhang, Yepeng Tang, Xingjian He, Longteng Guo, Tongtian Yue, Xinlong Wang

TL;DR

This work addresses the need for grounding targets at multiple granularities by introducing the Multi-Granularity RES (MRES) task and RefCOCOm benchmark, complemented by the MRES-32M large-scale VL grounding dataset. It proposes UniRES++, a unified multimodal large language model that jointly handles object- and part-level RES via a Multi-Granularity Vision Flow, granularity-aware tokenization, and dynamic feature exploitation, achieving state-of-the-art results across classic RES, generalized RES, and MRES benchmarks. The approach demonstrates the value of cross-granularity data and interactive feature flows for fine-grained visual grounding, and it provides public release of RefCOCOm, MRES-32M, and UniRES++. The work significantly advances practical vision-language understanding by enabling precise localization across omni-level visual granularities in real-world scenes.

Abstract

Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression. While traditional RES methods primarily address object-level grounding, real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity, such as multi-object, single object or part-level references. This introduces great challenges due to the diverse and nuanced ways users describe targets. However, existing datasets and models mainly focus on designing grounding specialists for object-level target localization, lacking the necessary data resources and unified frameworks for the more practical multi-grained RES. In this paper, we take a step further towards visual granularity unified RES task. To overcome the limitation of data scarcity, we introduce a new multi-granularity referring expression segmentation (MRES) task, alongside the RefCOCOm benchmark, which includes part-level annotations for advancing finer-grained visual understanding. In addition, we create MRES-32M, the largest visual grounding dataset, comprising over 32.2M masks and captions across 1M images, specifically designed for part-level vision-language grounding. To tackle the challenges of multi-granularity RES, we propose UniRES++, a unified multimodal large language model that integrates object-level and part-level RES tasks. UniRES++ incorporates targeted designs for fine-grained visual feature exploration. With the joint model architecture and parameters, UniRES++ achieves state-of-the-art performance across multiple benchmarks, including RefCOCOm for MRES, gRefCOCO for generalized RES, and RefCOCO, RefCOCO+, RefCOCOg for classic RES. To foster future research into multi-grained visual grounding, our RefCOCOm benchmark, MRES-32M dataset and model UniRES++ will be publicly available at https://github.com/Rubics-Xuan/MRES.

Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities

TL;DR

This work addresses the need for grounding targets at multiple granularities by introducing the Multi-Granularity RES (MRES) task and RefCOCOm benchmark, complemented by the MRES-32M large-scale VL grounding dataset. It proposes UniRES++, a unified multimodal large language model that jointly handles object- and part-level RES via a Multi-Granularity Vision Flow, granularity-aware tokenization, and dynamic feature exploitation, achieving state-of-the-art results across classic RES, generalized RES, and MRES benchmarks. The approach demonstrates the value of cross-granularity data and interactive feature flows for fine-grained visual grounding, and it provides public release of RefCOCOm, MRES-32M, and UniRES++. The work significantly advances practical vision-language understanding by enabling precise localization across omni-level visual granularities in real-world scenes.

Abstract

Referring expression segmentation (RES) aims at segmenting the entities' masks that match the descriptive language expression. While traditional RES methods primarily address object-level grounding, real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity, such as multi-object, single object or part-level references. This introduces great challenges due to the diverse and nuanced ways users describe targets. However, existing datasets and models mainly focus on designing grounding specialists for object-level target localization, lacking the necessary data resources and unified frameworks for the more practical multi-grained RES. In this paper, we take a step further towards visual granularity unified RES task. To overcome the limitation of data scarcity, we introduce a new multi-granularity referring expression segmentation (MRES) task, alongside the RefCOCOm benchmark, which includes part-level annotations for advancing finer-grained visual understanding. In addition, we create MRES-32M, the largest visual grounding dataset, comprising over 32.2M masks and captions across 1M images, specifically designed for part-level vision-language grounding. To tackle the challenges of multi-granularity RES, we propose UniRES++, a unified multimodal large language model that integrates object-level and part-level RES tasks. UniRES++ incorporates targeted designs for fine-grained visual feature exploration. With the joint model architecture and parameters, UniRES++ achieves state-of-the-art performance across multiple benchmarks, including RefCOCOm for MRES, gRefCOCO for generalized RES, and RefCOCO, RefCOCO+, RefCOCOg for classic RES. To foster future research into multi-grained visual grounding, our RefCOCOm benchmark, MRES-32M dataset and model UniRES++ will be publicly available at https://github.com/Rubics-Xuan/MRES.

Paper Structure

This paper contains 26 sections, 15 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Classic referring expression segmentation (RES) and generalized RES (GRES) tasks only support expressions that indicate either a single target object (e.g., (a)) or multiple target objects (e.g., (b)), whereas our multi-granularity RES (MRES) task accommodates expressions that indicates the specific part-level regions of object instances (e.g., (c)-(d) from our RefCOCOm benchmark). Besides, most previous grounding specialist models are designed exclusively for the object level grounding and they can only perform either RES or GRES. In contrast, with a unified architecture, our UniRES++ employs can simultaneously handle multiple localization tasks across various granularity levels.
  • Figure 2: RefCOCOm benchmark statistics. (a) the number of referring expressions per parts' category in the log scale. (b) the word cloud highlights the head categories.
  • Figure 3: Selected samples from our proposed RefCOCOm benchmark for multi-granularity RES (MRES) task.
  • Figure 4: The illustration of our data engine for constructing the MRES-32M dataset. (a) We begin by fine-tuning an MLLM to create a robust dense captioner capable of handling captioning across three granularity levels. (b) For object-level grounding data, we feed images and bounding boxes into the captioner and a powerful segmenter, generating captions and masks for various objects. (c) Leveraging LLMs' external knowledge, we decompose existing object category annotations into a vocabulary of part-level tags, which are then processed by an open-vocabulary segmenter and our captioner to obtain part-level annotations.
  • Figure 5: MRES-32M dataset statistics. (a) the number of referring expressions per objects' category in the log scale. (b) the number of referring expressions per parts' category in the log scale. (c) the word cloud highlights the head objects' categories. (d) the word cloud highlights the head parts' categories.
  • ...and 6 more figures