Table of Contents
Fetching ...

Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

Yuheng Shi, Xiaohuan Pei, Minjing Dong, Chang Xu

TL;DR

SD-RPN tackles the challenge of achieving fine-grained perception in multimodal LLMs without costly supervision or full-model fine-tuning. It denoises internal cross-modal attention to create high-quality pseudo-RoI labels and trains a lightweight Region Proposal Network on frozen backbones to predict RoIs in a single forward pass, thereby decoupling localization from autoregressive generation. The approach yields data-efficient, cross-model gains across multiple MLLM families and benchmarks (TextVQA, DocVQA, V-Star), including substantial improvements with as few as 10K training samples. Overall, SD-RPN offers a practical, annotation-free pathway to scalable high-resolution perception in MLLMs with favorable efficiency characteristics and broad applicability.

Abstract

Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive. While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process. In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM's middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM's middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations. To validate our approach, we integrate the framework into multiple MLLM families. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning. Code is available at https://github.com/YuHengsss/SD-RPN.

Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

TL;DR

SD-RPN tackles the challenge of achieving fine-grained perception in multimodal LLMs without costly supervision or full-model fine-tuning. It denoises internal cross-modal attention to create high-quality pseudo-RoI labels and trains a lightweight Region Proposal Network on frozen backbones to predict RoIs in a single forward pass, thereby decoupling localization from autoregressive generation. The approach yields data-efficient, cross-model gains across multiple MLLM families and benchmarks (TextVQA, DocVQA, V-Star), including substantial improvements with as few as 10K training samples. Overall, SD-RPN offers a practical, annotation-free pathway to scalable high-resolution perception in MLLMs with favorable efficiency characteristics and broad applicability.

Abstract

Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive. While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process. In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM's middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM's middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations. To validate our approach, we integrate the framework into multiple MLLM families. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning. Code is available at https://github.com/YuHengsss/SD-RPN.

Paper Structure

This paper contains 28 sections, 15 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: (a)The Pipeline of SD-RPN. The RPN is trained with pseudo-labels to effectively predict RoIs. These RoIs are then used to crop fine-grained sub-images for the final inference stage. (b)Performance Comparison. Performance evaluation with S$^2$shi2024we and ViCrop zhang2025mllms on the LLaVA-1.5-7B baseline. Accuracy is averaged over five Document and OCR benchmarks. Our SD-RPN achieves a superior trade-off between performance and throughput.
  • Figure 2: An overview of our pseudo-label generation pipeline. FG and BG denote the foreground and background respectively. Layer index is omitted for simplicity.
  • Figure 3: Attention magnitude VS. Localization accuracy.
  • Figure 4: Overview of our SD--RPN framework. Our lightweight RPN (top) is initialized from and built upon a frozen MLLM backbone to efficiently predict a dense RoI map. It is trained via self-distillation (bottom), where pseudo-labels are generated by denoising the full MLLM's internal response-to-image attention maps. Superscripts denote layer indices; subscripts denote token sources. We omit the system prompt tokens for brevity.
  • Figure 5: Performance-Throughput Trade-off on the V-Star Benchmark. Each point on the plot corresponds to a different maximum number of visual tokens. Our approach achieves a superior trade-off. The x-axis is on a logarithmic scale for clarity.
  • ...and 1 more figures