Table of Contents
Fetching ...

Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

Giacomo Frisoni, Lorenzo Molfetta, Mattia Buzzoni, Gianluca Moro

TL;DR

Graph-of-Mark is proposed, the first pixel-level visual prompting technique that overlays scene graphs onto the input image for spatial reasoning tasks and consistently improves the zero-shot capability of MLMs in interpreting object positions and relative directions.

Abstract

Recent advances in training-free visual prompting, such as Set-of-Mark, have emerged as a promising direction for enhancing the grounding capabilities of multimodal language models (MLMs). These techniques operate by partitioning the input image into object regions and annotating them with marks, predominantly boxes with numeric identifiers, before feeding the augmented image to the MLM. However, these approaches treat marked objects as isolated entities, failing to capture the relationships between them. On these premises, we propose Graph-of-Mark (GoM), the first pixel-level visual prompting technique that overlays scene graphs onto the input image for spatial reasoning tasks. We evaluate GoM across 3 open-source MLMs and 4 different datasets, conducting extensive ablations on drawn components and investigating the impact of auxiliary graph descriptions in the text prompt. Our results demonstrate that GoM consistently improves the zero-shot capability of MLMs in interpreting object positions and relative directions, improving base accuracy in visual question answering and localization up to 11 percentage points.

Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

TL;DR

Graph-of-Mark is proposed, the first pixel-level visual prompting technique that overlays scene graphs onto the input image for spatial reasoning tasks and consistently improves the zero-shot capability of MLMs in interpreting object positions and relative directions.

Abstract

Recent advances in training-free visual prompting, such as Set-of-Mark, have emerged as a promising direction for enhancing the grounding capabilities of multimodal language models (MLMs). These techniques operate by partitioning the input image into object regions and annotating them with marks, predominantly boxes with numeric identifiers, before feeding the augmented image to the MLM. However, these approaches treat marked objects as isolated entities, failing to capture the relationships between them. On these premises, we propose Graph-of-Mark (GoM), the first pixel-level visual prompting technique that overlays scene graphs onto the input image for spatial reasoning tasks. We evaluate GoM across 3 open-source MLMs and 4 different datasets, conducting extensive ablations on drawn components and investigating the impact of auxiliary graph descriptions in the text prompt. Our results demonstrate that GoM consistently improves the zero-shot capability of MLMs in interpreting object positions and relative directions, improving base accuracy in visual question answering and localization up to 11 percentage points.
Paper Structure (32 sections, 1 equation, 8 figures, 2 tables, 5 algorithms)

This paper contains 32 sections, 1 equation, 8 figures, 2 tables, 5 algorithms.

Figures (8)

  • Figure 1: Illustration of GoM. A multimodal language model is prompted by anchoring the input image in scene graphs expressing spatial object relations that are relevant to solving the task query provided by the user.
  • Figure 2: Qualitative example illustrating the impact of image preprocessing on VQA performance. The same question from VQA-v2 is posed to Qwen2.5-7B using 6 different hard visual prompts, highlighting how pixel transformations can influence the model's responses. For figure readability, the font size and line thickness have been increased compared to their actual values. Gray boxes denote baseline outputs, while blue boxes indicate those from our proposed GoM. See icon legend in Table \ref{['tab:main_results']}.
  • Figure 3: Effect of graph density. Performance of GoM in VQAv2 as a function of the number of edges in the visual scene graph. $0$ edges corresponds to SoM-like prompting.
  • Figure 4: Accuracy impact deriving from augmenting the visual ($I$) and textual ($T$) prompt with scene graphs (SG apex). Gemma-3 results. Proposed GoM solutions have $I^\text{SG}$.
  • Figure 5: GoM prompt template used in the Visual SG Only condition.
  • ...and 3 more figures