FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering
Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian Hüger, Thorsten Bagdonat, Hanno Gottschalk, Leo Schwinn
TL;DR
FOCUS tackles fine-grained VQA on high-resolution images by enabling a training-free, efficient cropping strategy that leverages MLLM internal KV-cache representations to locate target objects mentioned in prompts. It constructs an object relevance map from cached token similarities, ranks ROI proposals, verifies object existence, and performs VQA on the top region, achieving strong accuracy with substantially reduced compute compared to baselines like ZoomEye. Across multiple datasets (V*Bench, HRBench, MME-RealWorld-Lite) and MLLMs (global-view and global-local-view), FOCUS demonstrates favorable accuracy-efficiency trade-offs and robustness to hyperparameter changes, supported by ablations and qualitative analyses. The approach offers practical benefits for high-resolution visual reasoning and suggests broader potential for inference-time spatial localization in multimodal systems.
Abstract
While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and three types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.
