Table of Contents
Fetching ...

PhaseWin Search Framework Enable Efficient Object-Level Interpretation

Zihan Gu, Ruoyu Chen, Junchi Zhang, Yue Hu, Hua Zhang, Xiaochun Cao

TL;DR

This work introduces PhaseWin, a phase-window accelerated search for object-level attribution that replaces the quadratic-cost greedy region selection with a near-linear, coarse-to-fine procedure. By anchoring phases with high-gain regions, pruning low-potential candidates, and performing windowed fine-grained evaluation under dynamic supervision and annealing, PhaseWin closely tracks greedy performance while dramatically reducing model evaluations. The authors provide near-greedy theoretical guarantees under monotone submodular assumptions and demonstrate empirical gains across Grounding DINO and Florence-2 on COCO, LVIS, and RefCOCO, achieving over 95% of greedy faithfulness with roughly 20% of the computational budget. This approach shifts the efficiency-faithfulness frontier, enabling scalable, high-fidelity attribution for object-level multimodal models and broad applicability to image-based attribution tasks.

Abstract

Attribution is essential for interpreting object-level foundation models. Recent methods based on submodular subset selection have achieved high faithfulness, but their efficiency limitations hinder practical deployment in real-world scenarios. To address this, we propose PhaseWin, a novel phase-window search algorithm that enables faithful region attribution with near-linear complexity. PhaseWin replaces traditional quadratic-cost greedy selection with a phased coarse-to-fine search, combining adaptive pruning, windowed fine-grained selection, and dynamic supervision mechanisms to closely approximate greedy behavior while dramatically reducing model evaluations. Theoretically, PhaseWin retains near-greedy approximation guarantees under mild monotone submodular assumptions. Empirically, PhaseWin achieves over 95% of greedy attribution faithfulness using only 20% of the computational budget, and consistently outperforms other attribution baselines across object detection and visual grounding tasks with Grounding DINO and Florence-2. PhaseWin establishes a new state of the art in scalable, high-faithfulness attribution for object-level multimodal models.

PhaseWin Search Framework Enable Efficient Object-Level Interpretation

TL;DR

This work introduces PhaseWin, a phase-window accelerated search for object-level attribution that replaces the quadratic-cost greedy region selection with a near-linear, coarse-to-fine procedure. By anchoring phases with high-gain regions, pruning low-potential candidates, and performing windowed fine-grained evaluation under dynamic supervision and annealing, PhaseWin closely tracks greedy performance while dramatically reducing model evaluations. The authors provide near-greedy theoretical guarantees under monotone submodular assumptions and demonstrate empirical gains across Grounding DINO and Florence-2 on COCO, LVIS, and RefCOCO, achieving over 95% of greedy faithfulness with roughly 20% of the computational budget. This approach shifts the efficiency-faithfulness frontier, enabling scalable, high-fidelity attribution for object-level multimodal models and broad applicability to image-based attribution tasks.

Abstract

Attribution is essential for interpreting object-level foundation models. Recent methods based on submodular subset selection have achieved high faithfulness, but their efficiency limitations hinder practical deployment in real-world scenarios. To address this, we propose PhaseWin, a novel phase-window search algorithm that enables faithful region attribution with near-linear complexity. PhaseWin replaces traditional quadratic-cost greedy selection with a phased coarse-to-fine search, combining adaptive pruning, windowed fine-grained selection, and dynamic supervision mechanisms to closely approximate greedy behavior while dramatically reducing model evaluations. Theoretically, PhaseWin retains near-greedy approximation guarantees under mild monotone submodular assumptions. Empirically, PhaseWin achieves over 95% of greedy attribution faithfulness using only 20% of the computational budget, and consistently outperforms other attribution baselines across object detection and visual grounding tasks with Grounding DINO and Florence-2. PhaseWin establishes a new state of the art in scalable, high-faithfulness attribution for object-level multimodal models.

Paper Structure

This paper contains 34 sections, 3 theorems, 15 equations, 11 figures, 7 tables, 1 algorithm.

Key Result

Proposition 3.1

For maximizing a monotone submodular objective $\mathcal{F}:2^\mathcal{V}\to\mathbb{R}_+$ under a cardinality constraint $k$, let $S_{\mathrm{greedy}}$ denote the solution returned by the standard greedy algorithm and $S_{\mathrm{OPT}}$ denote the optimal subset of size $k$. Then the greedy algorith and no polynomial-time algorithm can surpass this bound unless $P=NP$nemhauser1978analysisfujishige

Figures (11)

  • Figure 1: A. Comparison of model forward counts between VPS and PhaseWin (window Size fixed as $16$) across different subregion numbers. B. Comparison of Insertion AUC and computational cost among representative methods, where PhaseWin achieves near-VPS faithfulness with a fraction of the computational budget.
  • Figure 2: PhaseWin Workflow. The algorithm alternates between (i) selecting an anchor region, (ii) pruning uninformative regions using fixed-ratio thresholds, and (iii) applying a windowed fine-grained selection with dynamic supervision.
  • Figure 3: Visualization of correct attribution cases on MS COCO, RefCOCO, and LVIS V1. Compared with ODAM and D-RISE, PhaseWin produces sharper and more faithful attributions. It matches or even exceeds VPS (Greedy) in insertion-AUC while requiring only $\sim$20% of its computational cost.
  • Figure 4: Trade-off between speed and precision.
  • Figure 5: Insertion AUC under Greedy (VPS). Grounding DINO is almost concave with only a few exceptions, while Florence-2 is completely convex.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Proposition 3.1
  • Theorem 3.1: Approximation Guarantee
  • Remark 3.1
  • proof
  • Definition F.1: Submodularity
  • Definition F.2: Supermodularity
  • Theorem F.1
  • proof : Sketch