Focus, Don't Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding

Mincheol Kwon; Minseung Lee; Seonga Choi; Miso Choi; Kyeong-Jin Oh; Hyunyoung Lee; Cheonyoung Park; Yongho Song; Seunghyun Park; Jinkyu Kim

Focus, Don't Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding

Mincheol Kwon, Minseung Lee, Seonga Choi, Miso Choi, Kyeong-Jin Oh, Hyunyoung Lee, Cheonyoung Park, Yongho Song, Seunghyun Park, Jinkyu Kim

Abstract

Large Vision-Language Models (LVLMs) have shown strong performance across various multimodal tasks by leveraging the reasoning capabilities of Large Language Models (LLMs). However, processing visually complex and information-rich images, such as infographics or document layouts, requires these models to generate a large number of visual tokens, leading to significant computational overhead. To address this, we propose PinPoint, a novel two-stage framework that first identifies instruction-relevant image regions and then refines them to extract fine-grained visual features for improved reasoning and efficiency. Central to our approach is the Instruction-Region Alignment, which localizes relevant regions using both visual input and textual instructions. We further introduce new annotations that provide richer ground-truth supervision for instruction-relevant regions across challenging VQA benchmarks: InfographicVQA, MultiPageDocVQA, and SinglePageDocVQA. Experimental results show that PinPoint not only achieves superior accuracy compared to existing methods but also reduces computational overhead by minimizing irrelevant visual tokens.

Focus, Don't Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding

Abstract

Paper Structure (37 sections, 6 equations, 17 figures, 16 tables)

This paper contains 37 sections, 6 equations, 17 figures, 16 tables.

Introduction
Related Work
Large Vision-Language Models (LVLM)
Efficiency of Token Usage
Method
Region-Level Feature Extraction
Instruction-Region Alignment
Selecting and Refining Instruction-Relevant Regions
Region Selection.
Region Refinement.
Training with Contrastive Loss
PinPoint Dataset
Experiments
Datasets
Implementation and Evaluation Details
...and 22 more sections

Figures (17)

Figure 1: Given an input instruction (e.g., "What are the average retirement savings for Americans?"), (a) a conventional approach aims to prune tokens based on attention weights; however, this method is often unreliable and can produce hallucinated outputs. (b) Our method, called PinPoint, first identifies instruction-relevant regions and then refines these regions to retain only the most relevant tokens, thereby improving the model’s factuality.
Figure 2: VQA performance increases with a larger proportion of instruction-relevant visual tokens. Model: LLaVA-NeXT llavanext. Data: InfographicVQA infographicvqa.
Figure 3: An overview of the proposed PinPoint architecture, which comprises two main stages: (1) Region Selection and (2) Region Refinement. Given a user instruction (e.g., “What symbol represents American identity?”), the Region Selection stage identifies adaptive top-$k$ instruction-relevant image regions (e.g., areas surrounding the Statue of Liberty) using region-level feature extraction via a sliding window approach. The Region Refinement stage then further processes these regions to extract fine-grained visual tokens, which are subsequently fed into a large language model (LLM) to generate the final answer (e.g., “The statue of liberty”).
Figure 4: Illustration for intra-image contrastive loss. Region features required to answer the given instruction are treated as positives, while all other regions are treated as negatives. The loss function pulls the positive pair $(E^t, E^v_{\text{pos}})$ closer in the embedding space, while pushing away the negative pairs $(E^t, E^v_{\text{neg},i})$.
Figure 5: (a) Built upon existing VQA benchmarks, such as InfoVQA infographicvqa, SPDocVQA docvqa, and MPDocVQA mpdocvqa, we construct a new dataset using a pipeline that provides annotations of instruction-relevant regions likely to contain answers and supporting evidence. (b) Examples of the annotated regions. More examples are available in the supplemental material.
...and 12 more figures

Focus, Don't Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding

Abstract

Focus, Don't Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding

Authors

Abstract

Table of Contents

Figures (17)