Table of Contents
Fetching ...

Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

Shizhan Gong, Minda Hu, Qiyuan Zhang, Chen Ma, Qi Dou

Abstract

Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance.

Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

Abstract

Vision-language models (VLMs) have achieved remarkable success across diverse tasks. However, concerns about their trustworthiness persist, particularly regarding tendencies to lean more on textual cues than visual evidence and the risk of producing ungrounded or fabricated responses. To address these issues, we propose Saliency-R1, a framework for improving the interpretability and faithfulness of VLMs reasoning. Specifically, we introduce a novel saliency map technique that efficiently highlights critical image regions contributing to generated tokens without additional computational overhead. This can further be extended to trace how visual information flows through the reasoning process to the final answers, revealing the alignment between the thinking process and the visual context. We use the overlap between the saliency maps and human-annotated bounding boxes as the reward function, and apply Group Relative Policy Optimization (GRPO) to align the salient parts and critical regions, encouraging models to focus on relevant areas when conduct reasoning. Experiments show Saliency-R1 improves reasoning faithfulness, interpretability, and overall task performance.

Paper Structure

This paper contains 32 sections, 11 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Main motivation of this work. Different thinking processes might focus on distinct regions of an image, even if they arrive at the correct answer. Unfaithful thinking processes either focus on irrelevant parts of the image or fail to consider the image.
  • Figure 2: Overview of our method. (a) Illustration of saliency map techniques based on logits decomposition. (b) Illustration of attention rollout for generating saliency maps with thinking tokens as the bottleneck. (c) GRPO with saliency maps alignment reward.
  • Figure 3: Qualitative evaluation of interpretability. We present example responses and their corresponding saliency maps from the base model, the SFT-tuned model (Saliency-R1-CI), and saliency-R1. The ground-truth bounding box is highlighted in red. Due to space constraints, some nonessential parts of the model responses are omitted; the full versions are provided in the Appendix.
  • Figure 4: Ablation Studies.Top: Average metrics on 9 VQA benchmarks. Bottom: Metrics on MME.
  • Figure 5: Additional examples of the saliency maps generated by our proposed saliency map techniques, and the corresponding questions and responses. The examples are generated using Saliency-R1-7B.