Table of Contents
Fetching ...

LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

Yang Miao, Jan-Nico Zaech, Xi Wang, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool

TL;DR

LangHOPS addresses open-vocabulary object-part instance segmentation by grounding object–part hierarchies in language space and refining part queries with a Multimodal Large Language Model. The approach integrates language-grounded hierarchies with an MLLM-based parsing stage to produce adaptive, open-vocabulary part queries that drive a part decoder, alongside a strong object segmentation module. Across in-domain, cross-dataset, and zero-shot evaluations on PartImageNet, Pascal-Part-116, and ADE20K, LangHOPS achieves state-of-the-art performance and demonstrates clear object–part synergy, with significant gains when additional part-level data are available. The work highlights the potential of language-driven, cross-modal reasoning for fine-grained scene parsing, and points to future directions in efficiency and 3D extensions for broader applicability.

Abstract

We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.

LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

TL;DR

LangHOPS addresses open-vocabulary object-part instance segmentation by grounding object–part hierarchies in language space and refining part queries with a Multimodal Large Language Model. The approach integrates language-grounded hierarchies with an MLLM-based parsing stage to produce adaptive, open-vocabulary part queries that drive a part decoder, alongside a strong object segmentation module. Across in-domain, cross-dataset, and zero-shot evaluations on PartImageNet, Pascal-Part-116, and ADE20K, LangHOPS achieves state-of-the-art performance and demonstrates clear object–part synergy, with significant gains when additional part-level data are available. The work highlights the potential of language-driven, cross-modal reasoning for fine-grained scene parsing, and points to future directions in efficiency and 3D extensions for broader applicability.

Abstract

We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.

Paper Structure

This paper contains 29 sections, 10 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Given a 2D image and user queries of candidate object-part categories, our method LangHOPS grounds the hierarchy between objects and parts in language space and subsequently leverage Multimodal Large Language Model to break down the segmented objects into parts.
  • Figure 2: LangHOPS framework. The left block illustrates the overall architecture, with an image backbone, an object segmentation module, object-part parser and a part segmentation module. The right block illustrates the ideas on the object-part parser, consisting of a "Language-Grounded Hierarchies" module embedding the object-part hierarchy in language space, and a "MLLM-based Parsing" module producing the part queries for segmentation using a MLLM.
  • Figure 3: Qualitative results of part-level segmentation if LangHOPS and baselines.
  • Figure 4: Visualization on Annotations of PartImageNet and PascalPart116 datasets.
  • Figure 5: Failure cases of LangHOPS in the cross-dataset setting of PartImageNet+INS+PART (training) $\rightarrow$ PPS-116(evaluation).