Table of Contents
Fetching ...

Getting to the Point: Why Pointing Improves LVLMs

Simone Alghisi, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi

Abstract

Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs' accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects' coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89\% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.

Getting to the Point: Why Pointing Improves LVLMs

Abstract

Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs' accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects' coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89\% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.
Paper Structure (24 sections, 8 figures, 9 tables)

This paper contains 24 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Using counting as a case study, we (a) fine-tune four state-of-the-art LVLMs under two approaches, Direct Counting and Point-then-Count (PtC); (b) compare these approaches across settings. For PtC models, we assess whether predicted coordinates are grounded in the image and perform ablation studies to quantify their contribution.
  • Figure 2: F1-score as a function of the number of target objects under the ID setting for models fine-tuned with DC or PtC. While LLaVA-OneVision and InternVL3.5 show similar performance, Qwen2.5-VL models benefit more from point supervision, suggesting PtC as a more robust choice across object counts.
  • Figure 3: Accuracy (%) as a function of the number of distractors after fine-tuning on NoisyTR for DC or PtC. PtC maintains near-perfect accuracy for InternVL3.5 and Qwen2.5-VL 7B, while for Qwen2.5-VL 3B and LLaVA-OneVision PtC degrades faster than DC as distractors increase.
  • Figure 4: Cell-level F1-score (%) for each model across our $9\times9$ grid. Results are computed on images containing 9 distractors. LLaVA-OneVision shows a clear left-to-right performance drop, while Qwen2.5-VL models achieve a lower F1-score at the bottom and right edges. In contrast, InternVL3.5 maintains near-uniform performance across the image.
  • Figure 5: Prompt template used to perform PtC for Molmo.
  • ...and 3 more figures