LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

Yang Miao; Jan-Nico Zaech; Xi Wang; Fabien Despinoy; Danda Pani Paudel; Luc Van Gool

LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

Yang Miao, Jan-Nico Zaech, Xi Wang, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool

TL;DR

LangHOPS addresses open-vocabulary object-part instance segmentation by grounding object–part hierarchies in language space and refining part queries with a Multimodal Large Language Model. The approach integrates language-grounded hierarchies with an MLLM-based parsing stage to produce adaptive, open-vocabulary part queries that drive a part decoder, alongside a strong object segmentation module. Across in-domain, cross-dataset, and zero-shot evaluations on PartImageNet, Pascal-Part-116, and ADE20K, LangHOPS achieves state-of-the-art performance and demonstrates clear object–part synergy, with significant gains when additional part-level data are available. The work highlights the potential of language-driven, cross-modal reasoning for fine-grained scene parsing, and points to future directions in efficiency and 3D extensions for broader applicability.

Abstract

We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.

LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

TL;DR

Abstract

LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)