Table of Contents
Fetching ...

PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Mennatullah Siam

TL;DR

PixFoundation investigates whether pixel-level grounding supervision in vision-language models improves grounding and VQA or inadvertently degrades them. It introduces two paired benchmarks, PixMMVP and PixCV-Bench, along with a prompt-sensitivity analysis and an interpretability tool to study how grounding emerges with respect to output tokens. The findings show that many pixel-level MLLMs under mask supervision lag behind simple baselines in both VQA and grounding, and that grounding can emerge in tokens not directly matching the referred expression, prompting a reevaluation of current training recipes. The work provides public benchmarks and an interpretability framework to guide the development of pixel-level grounding without sacrificing language capabilities, with broad implications for robust, interpretable multi-modal systems.

Abstract

Multiple works have emerged to push the boundaries of multi-modal large language models (MLLMs) towards pixel-level understanding. The current trend is to train MLLMs with pixel-level grounding supervision in terms of masks on large-scale labelled data and specialized decoders for the segmentation task. However, we show that such MLLMs when evaluated on recent challenging vision-centric benchmarks, exhibit a weak ability in visual question answering (VQA). Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such pixel-level supervision. In this work, we propose two novel challenging benchmarks with paired evaluation for both VQA and grounding. We demonstrate that simple baselines that are not unified achieve performance that matches or surpasses some of the pixel-level MLLMs. Our paired benchmarks and evaluation enable additional analysis on the reasons for failure with respect to VQA and/or grounding. Furthermore, we propose a prompt sensitivity analysis on both the language and visual prompts tailored for the grounding task. More importantly, we study the research question of ``When does grounding emerge in MLLMs with respect to the output tokens?'' We propose an interpretability tool that can be plugged into any MLLM to study the aforementioned question. We show that grounding does not necessarily coincide with the exact referring expression in the output, but can coincide with the object parts, its location, appearance, context or state. Code and datasets are publicly available at https://msiam.github.io/PixFoundationSeries/.

PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

TL;DR

PixFoundation investigates whether pixel-level grounding supervision in vision-language models improves grounding and VQA or inadvertently degrades them. It introduces two paired benchmarks, PixMMVP and PixCV-Bench, along with a prompt-sensitivity analysis and an interpretability tool to study how grounding emerges with respect to output tokens. The findings show that many pixel-level MLLMs under mask supervision lag behind simple baselines in both VQA and grounding, and that grounding can emerge in tokens not directly matching the referred expression, prompting a reevaluation of current training recipes. The work provides public benchmarks and an interpretability framework to guide the development of pixel-level grounding without sacrificing language capabilities, with broad implications for robust, interpretable multi-modal systems.

Abstract

Multiple works have emerged to push the boundaries of multi-modal large language models (MLLMs) towards pixel-level understanding. The current trend is to train MLLMs with pixel-level grounding supervision in terms of masks on large-scale labelled data and specialized decoders for the segmentation task. However, we show that such MLLMs when evaluated on recent challenging vision-centric benchmarks, exhibit a weak ability in visual question answering (VQA). Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such pixel-level supervision. In this work, we propose two novel challenging benchmarks with paired evaluation for both VQA and grounding. We demonstrate that simple baselines that are not unified achieve performance that matches or surpasses some of the pixel-level MLLMs. Our paired benchmarks and evaluation enable additional analysis on the reasons for failure with respect to VQA and/or grounding. Furthermore, we propose a prompt sensitivity analysis on both the language and visual prompts tailored for the grounding task. More importantly, we study the research question of ``When does grounding emerge in MLLMs with respect to the output tokens?'' We propose an interpretability tool that can be plugged into any MLLM to study the aforementioned question. We show that grounding does not necessarily coincide with the exact referring expression in the output, but can coincide with the object parts, its location, appearance, context or state. Code and datasets are publicly available at https://msiam.github.io/PixFoundationSeries/.

Paper Structure

This paper contains 30 sections, 1 equation, 22 figures, 10 tables.

Figures (22)

  • Figure 1: Research questions we tackle: (i) the grounding & VQA ability of pixel-level MLLMs in challenging scenarios (top), (ii) when does grounding emerge in standard MLLMs with respect to the output tokens? (bottom). The latter shows the noun phrases and their corresponding predicted segmentation, highlighted in red. These are extracted from LLaVA 1.5 attention maps with three masks due to the point prompt ambiguity from the maximum attention, highlighted as a black circle.
  • Figure 2: Failures of pixel-level MLLMs. (a) The first failure is the degraded performance in visual question answering in some of these models. (b) The second which relates to the first, is the degraded performance in instruction following, where the question is instructing the model to generate one letter from the options but fails to do so. (c) The third is the degraded performance in pixel-level visual grounding in some of these models. The predicted segmentation masks corresponding to the [SEG] token/s are highlighted in red.
  • Figure 3: Prompt sensitivity in visual grounding. Example showing the prompt sensitivity in relation to grounding, emphasizing the importance of language in visual grounding. The example is using Qwen2.5-VL. Prompt is (a) "Locate the dog's face and output all the coordinates in JSON format.", (b)"Locate the dog's face, output its bbox coordinates using JSON format."
  • Figure 4: Our Interpretability mechanism showing concept categories where the grounding emerges in PixMMVP using LLaVA 1.5 (7B). Top: referring expression, output response, noun phrases and concepts corresponding to the grounding using the oracle selection. Bottom: the four images with predicted segmentation mask, highlighted in red, using the oracle selection. The input point prompt highlighted as a black circle. It shows the segmentation of the referring expression emerging in different output noun phrases than the original expression.
  • Figure 5: PixMMVP qualitative comparison in visual grounding following the second probing. The referred expression is on top. It shows that mining for grounding within the attention maps of standard MLLMs w/ oracle mask selection is better than MLLMs trained with mask supervision, without degrading their VQA abilities. Thus, questioning the current training recipes and design choices of pixel-level MLLMs to fully utilize the potential in their base MLLMs.
  • ...and 17 more figures