PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
Junyi Li, Junfeng Wu, Weizhi Zhao, Song Bai, Xiang Bai
TL;DR
PartGLEE tackles the challenge of fine-grained, open-world perception by introducing a part-level foundation model that learns hierarchical relationships between objects and semantic parts through a Q-Former. By unifying training on abundant object-level data with limited part-level annotations and employing a joint loss with a restriction term, it enables top-down parsing of any object into its constituents. The approach yields state-of-the-art results on multiple part-level benchmarks while maintaining competitive object-level performance and demonstrates strong cross-dataset and cross-category generalization, including Segmentation in the Wild. These capabilities position PartGLEE as a foundation model for multi-granularity, region-level perception and a valuable contributor to downstream tasks and multi-modal LLMs that require detailed scene understanding.
Abstract
We present PartGLEE, a part-level foundation model for locating and identifying both objects and parts in images. Through a unified framework, PartGLEE accomplishes detection, segmentation, and grounding of instances at any granularity in the open world scenario. Specifically, we propose a Q-Former to construct the hierarchical relationship between objects and parts, parsing every object into corresponding semantic parts. By incorporating a large amount of object-level data, the hierarchical relationships can be extended, enabling PartGLEE to recognize a rich variety of parts. We conduct comprehensive studies to validate the effectiveness of our method, PartGLEE achieves the state-of-the-art performance across various part-level tasks and obtain competitive results on object-level tasks. The proposed PartGLEE significantly enhances hierarchical modeling capabilities and part-level perception over our previous GLEE model. Further analysis indicates that the hierarchical cognitive ability of PartGLEE is able to facilitate a detailed comprehension in images for mLLMs. The model and code will be released at https://provencestar.github.io/PartGLEE-Vision/ .
