Table of Contents
Fetching ...

PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects

Junyi Li, Junfeng Wu, Weizhi Zhao, Song Bai, Xiang Bai

TL;DR

PartGLEE tackles the challenge of fine-grained, open-world perception by introducing a part-level foundation model that learns hierarchical relationships between objects and semantic parts through a Q-Former. By unifying training on abundant object-level data with limited part-level annotations and employing a joint loss with a restriction term, it enables top-down parsing of any object into its constituents. The approach yields state-of-the-art results on multiple part-level benchmarks while maintaining competitive object-level performance and demonstrates strong cross-dataset and cross-category generalization, including Segmentation in the Wild. These capabilities position PartGLEE as a foundation model for multi-granularity, region-level perception and a valuable contributor to downstream tasks and multi-modal LLMs that require detailed scene understanding.

Abstract

We present PartGLEE, a part-level foundation model for locating and identifying both objects and parts in images. Through a unified framework, PartGLEE accomplishes detection, segmentation, and grounding of instances at any granularity in the open world scenario. Specifically, we propose a Q-Former to construct the hierarchical relationship between objects and parts, parsing every object into corresponding semantic parts. By incorporating a large amount of object-level data, the hierarchical relationships can be extended, enabling PartGLEE to recognize a rich variety of parts. We conduct comprehensive studies to validate the effectiveness of our method, PartGLEE achieves the state-of-the-art performance across various part-level tasks and obtain competitive results on object-level tasks. The proposed PartGLEE significantly enhances hierarchical modeling capabilities and part-level perception over our previous GLEE model. Further analysis indicates that the hierarchical cognitive ability of PartGLEE is able to facilitate a detailed comprehension in images for mLLMs. The model and code will be released at https://provencestar.github.io/PartGLEE-Vision/ .

PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects

TL;DR

PartGLEE tackles the challenge of fine-grained, open-world perception by introducing a part-level foundation model that learns hierarchical relationships between objects and semantic parts through a Q-Former. By unifying training on abundant object-level data with limited part-level annotations and employing a joint loss with a restriction term, it enables top-down parsing of any object into its constituents. The approach yields state-of-the-art results on multiple part-level benchmarks while maintaining competitive object-level performance and demonstrates strong cross-dataset and cross-category generalization, including Segmentation in the Wild. These capabilities position PartGLEE as a foundation model for multi-granularity, region-level perception and a valuable contributor to downstream tasks and multi-modal LLMs that require detailed scene understanding.

Abstract

We present PartGLEE, a part-level foundation model for locating and identifying both objects and parts in images. Through a unified framework, PartGLEE accomplishes detection, segmentation, and grounding of instances at any granularity in the open world scenario. Specifically, we propose a Q-Former to construct the hierarchical relationship between objects and parts, parsing every object into corresponding semantic parts. By incorporating a large amount of object-level data, the hierarchical relationships can be extended, enabling PartGLEE to recognize a rich variety of parts. We conduct comprehensive studies to validate the effectiveness of our method, PartGLEE achieves the state-of-the-art performance across various part-level tasks and obtain competitive results on object-level tasks. The proposed PartGLEE significantly enhances hierarchical modeling capabilities and part-level perception over our previous GLEE model. Further analysis indicates that the hierarchical cognitive ability of PartGLEE is able to facilitate a detailed comprehension in images for mLLMs. The model and code will be released at https://provencestar.github.io/PartGLEE-Vision/ .
Paper Structure (23 sections, 6 equations, 15 figures, 16 tables)

This paper contains 23 sections, 6 equations, 15 figures, 16 tables.

Figures (15)

  • Figure 1: An illustrative example demonstrating image annotations at diverse granularities across multiple datasets. The annotations at hierarchical levels with corresponding relationships are depicted on the right side. Below is a visualization of our segmentation results at multiple granularities.
  • Figure 2: Framework of PartGLEE. The Q-Former takes each object query as input and output the corresponding part queries. These queries are then fed into the object decoder and the part decoder respectively to generate hierarchical predictions.
  • Figure 3: Matching mechanisms of PartGLEE. Two separate forward passes are performed on the same image to obtain hierarchical segmentation results.
  • Figure 4: Various designs for generating predictions at different hierarchies. In scheme (a), we only utilize a single decoder to generate predictions for both objects and parts. In scheme (b), two parallel pixel decoders are employed to generate feature maps at different levels, aiming to explore the effectiveness of feature maps at different granularity. In scheme (c), we use two independent decoders to generate predictions for objects and parts respectively.
  • Figure I: Visualization of the effectiveness after adopting the Restriction Loss.
  • ...and 10 more figures