Table of Contents
Fetching ...

DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding

Geng Li, Jinglin Xu, Yunzhen Zhao, Yuxin Peng

TL;DR

This work introduces DyFo, a training-free dynamic focus visual search framework that enhances fine-grained visual understanding in large multimodal models by enabling a cooperative loop between LMMs and visual experts via Monte Carlo Tree Search. The method employs a Focus Adjuster and a Focus Tree Search to dynamically select salient image regions without extra training or localization modules, aiming to reduce hallucinations in cluttered scenes. Empirical evaluations on POPE and V* Bench demonstrate consistent improvements over baselines, with ablations showing the value of the action space design and the LMM-visual expert collaboration. Overall, DyFo offers a plug-and-play solution to bolster high-resolution visual reasoning in LMMs while mitigating common failure modes like hallucination.

Abstract

Humans can effortlessly locate desired objects in cluttered environments, relying on a cognitive mechanism known as visual search to efficiently filter out irrelevant information and focus on task-related regions. Inspired by this process, we propose Dyfo (Dynamic Focus), a training-free dynamic focusing visual search method that enhances fine-grained visual understanding in large multimodal models (LMMs). Unlike existing approaches which require additional modules or data collection, Dyfo leverages a bidirectional interaction between LMMs and visual experts, using a Monte Carlo Tree Search (MCTS) algorithm to simulate human-like focus adjustments. This enables LMMs to focus on key visual regions while filtering out irrelevant content, without introducing additional training caused by vocabulary expansion or the integration of specialized localization modules. Experimental results demonstrate that Dyfo significantly improves fine-grained visual understanding and reduces hallucination issues in LMMs, achieving superior performance across both fixed and dynamic resolution models. The code is available at https://github.com/PKU-ICST-MIPL/DyFo_CVPR2025

DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding

TL;DR

This work introduces DyFo, a training-free dynamic focus visual search framework that enhances fine-grained visual understanding in large multimodal models by enabling a cooperative loop between LMMs and visual experts via Monte Carlo Tree Search. The method employs a Focus Adjuster and a Focus Tree Search to dynamically select salient image regions without extra training or localization modules, aiming to reduce hallucinations in cluttered scenes. Empirical evaluations on POPE and V* Bench demonstrate consistent improvements over baselines, with ablations showing the value of the action space design and the LMM-visual expert collaboration. Overall, DyFo offers a plug-and-play solution to bolster high-resolution visual reasoning in LMMs while mitigating common failure modes like hallucination.

Abstract

Humans can effortlessly locate desired objects in cluttered environments, relying on a cognitive mechanism known as visual search to efficiently filter out irrelevant information and focus on task-related regions. Inspired by this process, we propose Dyfo (Dynamic Focus), a training-free dynamic focusing visual search method that enhances fine-grained visual understanding in large multimodal models (LMMs). Unlike existing approaches which require additional modules or data collection, Dyfo leverages a bidirectional interaction between LMMs and visual experts, using a Monte Carlo Tree Search (MCTS) algorithm to simulate human-like focus adjustments. This enables LMMs to focus on key visual regions while filtering out irrelevant content, without introducing additional training caused by vocabulary expansion or the integration of specialized localization modules. Experimental results demonstrate that Dyfo significantly improves fine-grained visual understanding and reduces hallucination issues in LMMs, achieving superior performance across both fixed and dynamic resolution models. The code is available at https://github.com/PKU-ICST-MIPL/DyFo_CVPR2025

Paper Structure

This paper contains 12 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: An illustration of three mechanisms for large multimodal models (LMMs) in fine-grained visual understanding tasks.
  • Figure 2: An illustration of DyFo framework, composed by Focus Adjuster and Focus Tree Search (Section \ref{['sec:our_method']}).
  • Figure 3: An illustration of Focus Adjuster of DyFo.
  • Figure 4: Comparison between the responses of LLaVA-v1.5, Qwen2-VL and our method DyFo on POPE cases. The final focus region is highlighted in the image using red bounding boxes.
  • Figure 5: Comparison between the responses of LLaVA-v1.5, Qwen2-VL and our method DyFo on V* Bench several cases. The final focus region is highlighted in the image using red bounding boxes.