LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

Soumyaratna Debnath; Bui Duc Manh; Zinan Liu; Lin Wang

LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

Soumyaratna Debnath, Bui Duc Manh, Zinan Liu, Lin Wang

Abstract

Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align perceptual saliency with textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering benchmarks. The results show dramatic gains, with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, LLMind retains up to 82%, 92%, and 97% of the full-resolution performance using only 1%, 3%, and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.

LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

Abstract

Paper Structure (15 sections, 10 equations, 7 figures, 3 tables)

This paper contains 15 sections, 10 equations, 7 figures, 3 tables.

Introduction
Related Works
Methodology
Bio-inspired Adaptive Sampling Strategy
Closed-Loop Semantic Feedback
Perceptual Loss.
Semantic Loss.
Experiments and Evaluation
Settings and Implementation Details
Baseline Methods
Results and Discussion
Comparison on Scene-level VQA.
Comparison on Region-guided VQA.
Ablation and Analysis
Conclusion and Future Work

Figures (7)

Figure 1: (a) Illustration of the underlying principle of our Bio-inspired Adaptive Sampling Strategy (BASS). (b) Performance comparison at 5% and 3% pixel budgets using Qwen2.5-VL bai2025qwen2 across datasets. Project page: https://empactlab.github.io/LLMind-CVPR-2026/
Figure 2: Qualitative comparison with Qwen2.5-VL on Seed-Bench at 5% pixel budget. LLMind adaptively allocates resolution to semantically important regions, preserving visual evidence critical for answering the question. Zoom in for a better view.
Figure 3: Overview of the proposed framework. Given the input image $I$, the MLP network predicts the Möbius transformation coefficients, which are used by BASS (Sec. \ref{['sec:bass']}) to produce the sampled image $\hat{I}$. The Perceptual loss $\mathcal{L}_{\text{img}}$ (Eq. \ref{['eq:img_loss']}) between $I$ and $\hat{I}$ flows through the network to optimize the MLP parameters. In parallel, a frozen VLM processes a set of questions $q$ to generate predicted answers $y_{\text{pred}}$. These are compared with the ground-truth answers $y_{\text{gt}}$ to obtain the Semantic loss $\mathcal{L}_{\text{text}}$ (Eq. \ref{['eq:text_loss']}) for further guiding the optimization of the MLP parameters using SPSA (Eq. \ref{['eq:spsa']}).
Figure 4: Illustration of the Bio-inspired Adaptive Sampling Strategy (BASS). Given an input image $I$, the MLP predicts Möbius parameters $\theta$ to warp the image toward salient regions (Eq. \ref{['eq:mobius']}). The warped image is uniformly sampled under pixel budget $B$ through $\mathcal{S}_B(\cdot)$, and then reconstructed to its original resolution via an interpolation operator $\mathcal{I}(\cdot)$. Finally, the inverse transformation (Eq. \ref{['eq:inv_mobius']}) restores the global spatial structure, yielding the adaptively sampled image $\hat{I}$.
Figure 5: Illustration of the compared sampling methods under 10% pixel budget. Zoom in for a better view.
...and 2 more figures

LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

Abstract

LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

Authors

Abstract

Table of Contents

Figures (7)