Table of Contents
Fetching ...

LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

Soumyaratna Debnath, Bui Duc Manh, Zinan Liu, Lin Wang

Abstract

Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align perceptual saliency with textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering benchmarks. The results show dramatic gains, with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, LLMind retains up to 82%, 92%, and 97% of the full-resolution performance using only 1%, 3%, and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.

LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

Abstract

Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align perceptual saliency with textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering benchmarks. The results show dramatic gains, with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, LLMind retains up to 82%, 92%, and 97% of the full-resolution performance using only 1%, 3%, and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.
Paper Structure (15 sections, 10 equations, 7 figures, 3 tables)

This paper contains 15 sections, 10 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) Illustration of the underlying principle of our Bio-inspired Adaptive Sampling Strategy (BASS). (b) Performance comparison at 5% and 3% pixel budgets using Qwen2.5-VL bai2025qwen2 across datasets. Project page: https://empactlab.github.io/LLMind-CVPR-2026/
  • Figure 2: Qualitative comparison with Qwen2.5-VL on Seed-Bench at 5% pixel budget. LLMind adaptively allocates resolution to semantically important regions, preserving visual evidence critical for answering the question. Zoom in for a better view.
  • Figure 3: Overview of the proposed framework. Given the input image $I$, the MLP network predicts the Möbius transformation coefficients, which are used by BASS (Sec. \ref{['sec:bass']}) to produce the sampled image $\hat{I}$. The Perceptual loss $\mathcal{L}_{\text{img}}$ (Eq. \ref{['eq:img_loss']}) between $I$ and $\hat{I}$ flows through the network to optimize the MLP parameters. In parallel, a frozen VLM processes a set of questions $q$ to generate predicted answers $y_{\text{pred}}$. These are compared with the ground-truth answers $y_{\text{gt}}$ to obtain the Semantic loss $\mathcal{L}_{\text{text}}$ (Eq. \ref{['eq:text_loss']}) for further guiding the optimization of the MLP parameters using SPSA (Eq. \ref{['eq:spsa']}).
  • Figure 4: Illustration of the Bio-inspired Adaptive Sampling Strategy (BASS). Given an input image $I$, the MLP predicts Möbius parameters $\theta$ to warp the image toward salient regions (Eq. \ref{['eq:mobius']}). The warped image is uniformly sampled under pixel budget $B$ through $\mathcal{S}_B(\cdot)$, and then reconstructed to its original resolution via an interpolation operator $\mathcal{I}(\cdot)$. Finally, the inverse transformation (Eq. \ref{['eq:inv_mobius']}) restores the global spatial structure, yielding the adaptively sampled image $\hat{I}$.
  • Figure 5: Illustration of the compared sampling methods under 10% pixel budget. Zoom in for a better view.
  • ...and 2 more figures