PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?
Mennatullah Siam
TL;DR
PixFoundation investigates whether pixel-level grounding supervision in vision-language models improves grounding and VQA or inadvertently degrades them. It introduces two paired benchmarks, PixMMVP and PixCV-Bench, along with a prompt-sensitivity analysis and an interpretability tool to study how grounding emerges with respect to output tokens. The findings show that many pixel-level MLLMs under mask supervision lag behind simple baselines in both VQA and grounding, and that grounding can emerge in tokens not directly matching the referred expression, prompting a reevaluation of current training recipes. The work provides public benchmarks and an interpretability framework to guide the development of pixel-level grounding without sacrificing language capabilities, with broad implications for robust, interpretable multi-modal systems.
Abstract
Multiple works have emerged to push the boundaries of multi-modal large language models (MLLMs) towards pixel-level understanding. The current trend is to train MLLMs with pixel-level grounding supervision in terms of masks on large-scale labelled data and specialized decoders for the segmentation task. However, we show that such MLLMs when evaluated on recent challenging vision-centric benchmarks, exhibit a weak ability in visual question answering (VQA). Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such pixel-level supervision. In this work, we propose two novel challenging benchmarks with paired evaluation for both VQA and grounding. We demonstrate that simple baselines that are not unified achieve performance that matches or surpasses some of the pixel-level MLLMs. Our paired benchmarks and evaluation enable additional analysis on the reasons for failure with respect to VQA and/or grounding. Furthermore, we propose a prompt sensitivity analysis on both the language and visual prompts tailored for the grounding task. More importantly, we study the research question of ``When does grounding emerge in MLLMs with respect to the output tokens?'' We propose an interpretability tool that can be plugged into any MLLM to study the aforementioned question. We show that grounding does not necessarily coincide with the exact referring expression in the output, but can coincide with the object parts, its location, appearance, context or state. Code and datasets are publicly available at https://msiam.github.io/PixFoundationSeries/.
