Table of Contents
Fetching ...

Focus, Don't Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding

Mincheol Kwon, Minseung Lee, Seonga Choi, Miso Choi, Kyeong-Jin Oh, Hyunyoung Lee, Cheonyoung Park, Yongho Song, Seunghyun Park, Jinkyu Kim

Abstract

Large Vision-Language Models (LVLMs) have shown strong performance across various multimodal tasks by leveraging the reasoning capabilities of Large Language Models (LLMs). However, processing visually complex and information-rich images, such as infographics or document layouts, requires these models to generate a large number of visual tokens, leading to significant computational overhead. To address this, we propose PinPoint, a novel two-stage framework that first identifies instruction-relevant image regions and then refines them to extract fine-grained visual features for improved reasoning and efficiency. Central to our approach is the Instruction-Region Alignment, which localizes relevant regions using both visual input and textual instructions. We further introduce new annotations that provide richer ground-truth supervision for instruction-relevant regions across challenging VQA benchmarks: InfographicVQA, MultiPageDocVQA, and SinglePageDocVQA. Experimental results show that PinPoint not only achieves superior accuracy compared to existing methods but also reduces computational overhead by minimizing irrelevant visual tokens.

Focus, Don't Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding

Abstract

Large Vision-Language Models (LVLMs) have shown strong performance across various multimodal tasks by leveraging the reasoning capabilities of Large Language Models (LLMs). However, processing visually complex and information-rich images, such as infographics or document layouts, requires these models to generate a large number of visual tokens, leading to significant computational overhead. To address this, we propose PinPoint, a novel two-stage framework that first identifies instruction-relevant image regions and then refines them to extract fine-grained visual features for improved reasoning and efficiency. Central to our approach is the Instruction-Region Alignment, which localizes relevant regions using both visual input and textual instructions. We further introduce new annotations that provide richer ground-truth supervision for instruction-relevant regions across challenging VQA benchmarks: InfographicVQA, MultiPageDocVQA, and SinglePageDocVQA. Experimental results show that PinPoint not only achieves superior accuracy compared to existing methods but also reduces computational overhead by minimizing irrelevant visual tokens.
Paper Structure (37 sections, 6 equations, 17 figures, 16 tables)

This paper contains 37 sections, 6 equations, 17 figures, 16 tables.

Figures (17)

  • Figure 1: Given an input instruction (e.g., "What are the average retirement savings for Americans?"), (a) a conventional approach aims to prune tokens based on attention weights; however, this method is often unreliable and can produce hallucinated outputs. (b) Our method, called PinPoint, first identifies instruction-relevant regions and then refines these regions to retain only the most relevant tokens, thereby improving the model’s factuality.
  • Figure 2: VQA performance increases with a larger proportion of instruction-relevant visual tokens. Model: LLaVA-NeXT llavanext. Data: InfographicVQA infographicvqa.
  • Figure 3: An overview of the proposed PinPoint architecture, which comprises two main stages: (1) Region Selection and (2) Region Refinement. Given a user instruction (e.g., “What symbol represents American identity?”), the Region Selection stage identifies adaptive top-$k$ instruction-relevant image regions (e.g., areas surrounding the Statue of Liberty) using region-level feature extraction via a sliding window approach. The Region Refinement stage then further processes these regions to extract fine-grained visual tokens, which are subsequently fed into a large language model (LLM) to generate the final answer (e.g., “The statue of liberty”).
  • Figure 4: Illustration for intra-image contrastive loss. Region features required to answer the given instruction are treated as positives, while all other regions are treated as negatives. The loss function pulls the positive pair $(E^t, E^v_{\text{pos}})$ closer in the embedding space, while pushing away the negative pairs $(E^t, E^v_{\text{neg},i})$.
  • Figure 5: (a) Built upon existing VQA benchmarks, such as InfoVQA infographicvqa, SPDocVQA docvqa, and MPDocVQA mpdocvqa, we construct a new dataset using a pipeline that provides annotations of instruction-relevant regions likely to contain answers and supporting evidence. (b) Examples of the annotated regions. More examples are available in the supplemental material.
  • ...and 12 more figures