Table of Contents
Fetching ...

PixelLM: Pixel Reasoning with Large Multimodal Model

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, Xiaojie Jin

TL;DR

PixelLM tackles the challenge of pixel-level reasoning for open-set targets by integrating a lightweight pixel decoder with a segmentation codebook into a standard large multimodal model framework, eliminating reliance on external segmentation modules. It introduces a multi-scale token fusion mechanism and a target refinement loss to handle multiple targets with high mask quality. To support research, the authors build MUSE, a large, richly annotated multi-target segmentation benchmark generated via a GPT-4V-based pipeline. Empirically, PixelLM achieves state-of-the-art results on MUSE and multi-target referring segmentation while offering substantial efficiency gains, with ablations validating the contribution of each component.

Abstract

While large multimodal models (LMMs) have achieved remarkable progress, generating pixel-level masks for image reasoning tasks involving multiple open-world targets remains a challenge. To bridge this gap, we introduce PixelLM, an effective and efficient LMM for pixel-level reasoning and understanding. Central to PixelLM is a novel, lightweight pixel decoder and a comprehensive segmentation codebook. The decoder efficiently produces masks from the hidden embeddings of the codebook tokens, which encode detailed target-relevant information. With this design, PixelLM harmonizes with the structure of popular LMMs and avoids the need for additional costly segmentation models. Furthermore, we propose a target refinement loss to enhance the model's ability to differentiate between multiple targets, leading to substantially improved mask quality. To advance research in this area, we construct MUSE, a high-quality multi-target reasoning segmentation benchmark. PixelLM excels across various pixel-level image reasoning and understanding tasks, outperforming well-established methods in multiple benchmarks, including MUSE, single- and multi-referring segmentation. Comprehensive ablations confirm the efficacy of each proposed component. All code, models, and datasets will be publicly available.

PixelLM: Pixel Reasoning with Large Multimodal Model

TL;DR

PixelLM tackles the challenge of pixel-level reasoning for open-set targets by integrating a lightweight pixel decoder with a segmentation codebook into a standard large multimodal model framework, eliminating reliance on external segmentation modules. It introduces a multi-scale token fusion mechanism and a target refinement loss to handle multiple targets with high mask quality. To support research, the authors build MUSE, a large, richly annotated multi-target segmentation benchmark generated via a GPT-4V-based pipeline. Empirically, PixelLM achieves state-of-the-art results on MUSE and multi-target referring segmentation while offering substantial efficiency gains, with ablations validating the contribution of each component.

Abstract

While large multimodal models (LMMs) have achieved remarkable progress, generating pixel-level masks for image reasoning tasks involving multiple open-world targets remains a challenge. To bridge this gap, we introduce PixelLM, an effective and efficient LMM for pixel-level reasoning and understanding. Central to PixelLM is a novel, lightweight pixel decoder and a comprehensive segmentation codebook. The decoder efficiently produces masks from the hidden embeddings of the codebook tokens, which encode detailed target-relevant information. With this design, PixelLM harmonizes with the structure of popular LMMs and avoids the need for additional costly segmentation models. Furthermore, we propose a target refinement loss to enhance the model's ability to differentiate between multiple targets, leading to substantially improved mask quality. To advance research in this area, we construct MUSE, a high-quality multi-target reasoning segmentation benchmark. PixelLM excels across various pixel-level image reasoning and understanding tasks, outperforming well-established methods in multiple benchmarks, including MUSE, single- and multi-referring segmentation. Comprehensive ablations confirm the efficacy of each proposed component. All code, models, and datasets will be publicly available.
Paper Structure (25 sections, 8 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 25 sections, 8 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding. We show its visualization results in following scenarios: 1) Multi-target reasoning segmentation; 2) Instance-level segmentation tied with text description; 3) Multi-referring segmentation; 4) Conversation
  • Figure 2: Overview of the proposed PixelLM model architecture. (Left) Overall architecture. (Right) The proposed lightweight pixel decoder. Trainable LoRA parameters are incorporated into the LLM. All parameters except those for the CLIP encoder and LLM are trainable.
  • Figure 3: The segmentation codebook example comprises two scales with two tokens each. Each attention map results from the interaction between one token and its corresponding image feature in the decoder. The first two rows depict the token fusion mechanism, while the final row demonstrates a failure case arising from the utilization of only one token.
  • Figure 4: The left panel illustrates the prompt employed in our GPT-4V data generation pipeline. The right panel showcases an example of the generated data.
  • Figure 5: Comparison between PixelLM and PixelLM$^\dagger$ (w/o token fusion mechanism and target refinement loss.)
  • ...and 3 more figures