Table of Contents
Fetching ...

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

Ansel Blume, Jeonghwan Kim, Hyeonjeong Ha, Elen Chatikyan, Xiaomeng Jin, Khanh Duy Nguyen, Nanyun Peng, Kai-Wei Chang, Derek Hoiem, Heng Ji

TL;DR

This work tackles the challenge of enabling fine-grained, part-level grounding in large multimodal models by introducing Explanatory Part Segmentation and the Partonomy benchmark, which together evaluate grounded parts reasoning and explanations. It identifies two architectural flaws common to segmentation-enabled LMMs—distribution shifts from [SEG] tokens and discarding past mask predictions—and proposes PLUM, a span-tagging, mask-feedback LMM that avoids segmentation tokens and conditions on prior masks. PLUM achieves strong zero-shot part grounding and competitive finetuning on reasoning segmentation and VQA benchmarks, outperforming prior segmenting LMMs and approaching data-rich baselines while requiring less segmentation supervision. The Partonomy/PLUM framework provides a principled foundation for interpretable, fine-grained visual grounding in multimodal AI and opens pathways for future improvements in grounded, compositional reasoning.

Abstract

Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

TL;DR

This work tackles the challenge of enabling fine-grained, part-level grounding in large multimodal models by introducing Explanatory Part Segmentation and the Partonomy benchmark, which together evaluate grounded parts reasoning and explanations. It identifies two architectural flaws common to segmentation-enabled LMMs—distribution shifts from [SEG] tokens and discarding past mask predictions—and proposes PLUM, a span-tagging, mask-feedback LMM that avoids segmentation tokens and conditions on prior masks. PLUM achieves strong zero-shot part grounding and competitive finetuning on reasoning segmentation and VQA benchmarks, outperforming prior segmenting LMMs and approaching data-rich baselines while requiring less segmentation supervision. The Partonomy/PLUM framework provides a principled foundation for interpretable, fine-grained visual grounding in multimodal AI and opens pathways for future improvements in grounded, compositional reasoning.

Abstract

Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.

Paper Structure

This paper contains 56 sections, 11 figures, 14 tables.

Figures (11)

  • Figure 1: The Partonomy dataset evaluates LMMs' part understanding through the Explanatory Part Segmentation task. Given an input image, a segmentation-enabled LMM selects a textual explanation and generates part segmentation masks which serve as textual and visual rationale for its answer choice. Our question-answer mutation framework generates challenging answer choices by predicting part co-occurrence and by selecting parts from confusable objects.
  • Figure 2: An example of PLUM's part understanding compared to recent segmenting LMMs trained on part data.
  • Figure 3: Overview of Plum.Plum is not dependent on special tokens (e.g., <SEG>) added during finetuning to generate segmentation masks. Plum uses a bidirectional span extractor that automatically determines which tokens should be passed to the mask decoder to generate segmentations. A feedback loop based on SAM's mask decoder enables Plum to condition future segmentations on those past.
  • Figure 4: Performance (micro/macro gIoU) on Partonomy validation splits.
  • Figure 5: Ablations (a) The feedback loop and tagging mechanism improve part segmentation on Partonomy-PartImageNet; (b) Varying the KL-constraint weight $\lambda_{\text{KL}}$ trades off segmentation gIoU and TextVQA accuracy.
  • ...and 6 more figures