Table of Contents
Fetching ...

FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian Hüger, Thorsten Bagdonat, Hanno Gottschalk, Leo Schwinn

TL;DR

FOCUS tackles fine-grained VQA on high-resolution images by enabling a training-free, efficient cropping strategy that leverages MLLM internal KV-cache representations to locate target objects mentioned in prompts. It constructs an object relevance map from cached token similarities, ranks ROI proposals, verifies object existence, and performs VQA on the top region, achieving strong accuracy with substantially reduced compute compared to baselines like ZoomEye. Across multiple datasets (V*Bench, HRBench, MME-RealWorld-Lite) and MLLMs (global-view and global-local-view), FOCUS demonstrates favorable accuracy-efficiency trade-offs and robustness to hyperparameter changes, supported by ablations and qualitative analyses. The approach offers practical benefits for high-resolution visual reasoning and suggests broader potential for inference-time spatial localization in multimodal systems.

Abstract

While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and three types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.

FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

TL;DR

FOCUS tackles fine-grained VQA on high-resolution images by enabling a training-free, efficient cropping strategy that leverages MLLM internal KV-cache representations to locate target objects mentioned in prompts. It constructs an object relevance map from cached token similarities, ranks ROI proposals, verifies object existence, and performs VQA on the top region, achieving strong accuracy with substantially reduced compute compared to baselines like ZoomEye. Across multiple datasets (V*Bench, HRBench, MME-RealWorld-Lite) and MLLMs (global-view and global-local-view), FOCUS demonstrates favorable accuracy-efficiency trade-offs and robustness to hyperparameter changes, supported by ablations and qualitative analyses. The approach offers practical benefits for high-resolution visual reasoning and suggests broader potential for inference-time spatial localization in multimodal systems.

Abstract

While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and three types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.

Paper Structure

This paper contains 28 sections, 5 equations, 11 figures, 16 tables, 1 algorithm.

Figures (11)

  • Figure 1: Many VQA datasets focus on large instead of tiny objects. This figure shows the relative area of question-relevant objects w.r.t. the image. V*Bench contains various tiny VQA-relevant objects.
  • Figure 2: Overview of FOCUS. The method identifies the target objects mentioned in the question (I) and constructs their object relevance maps using cosine similarity between cached tokens (II + III). Then, it proposes regions of interest and ranks those by the existence confidence of the target object in each region (IV + V). Finally, the selected region is used to perform VQA (VI).
  • Figure 3: FOCUS is at the Pareto front on fine-grained VQA benchmarks. Given the same computation budget, FOCUS (purple crosses) significantly outperforms other visual cropping methods, on three different datasets and for two model architectures. It achieves $3\,\text{--}\,6.5 \times$ higher efficiency than the best-performing baseline ZoomEye. Note that we show only a limited set of data points for each method to ensure a clear visualization. The full results are available in \ref{['appendix:full_numerical_results']}.
  • Figure 3: Ablation studies of FOCUS. We evaluate the influences of design choices of our method based on accuracy and recall. "rel." is short for "relevance".
  • Figure 4: Results on VQA datasets with large objects. We find only minor performance degradation of FOCUS w.r.t. the base model.
  • ...and 6 more figures