Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection
Jingyao Wang, Yiming Chen, Lingyu Si, Changwen Zheng
TL;DR
This work tackles the challenge of adapting Vision-Language Models to complex wide-area scenes, where diverse and sparsely distributed content challenges existing pre-trained models. It introduces Hierarchical Coresets Selection (HCS), a plug-and-play, training-free mechanism that selects a small, interpretable set of regions by optimizing a four-factor importance function (utility, representativeness, robustness, synergy) and refining region partitions layer-by-layer. The authors provide theoretical guarantees showing that the selected coreset closely approximates full-image loss and affords tighter generalization bounds, and they validate HCS with extensive experiments on image classification and semantic segmentation across multiple VLM baselines, achieving consistent improvements with modest computational overhead. Practically, HCS enables rapid, scalable wide-area scene understanding without additional fine-tuning, enhancing robustness and generalization for unseen scenes across diverse domains.
Abstract
Scene understanding is one of the core tasks in computer vision, aiming to extract semantic information from images to identify objects, scene categories, and their interrelationships. Although advancements in Vision-Language Models (VLMs) have driven progress in this field, existing VLMs still face challenges in adaptation to unseen complex wide-area scenes. To address the challenges, this paper proposes a Hierarchical Coresets Selection (HCS) mechanism to advance the adaptation of VLMs in complex wide-area scene understanding. It progressively refines the selected regions based on the proposed theoretically guaranteed importance function, which considers utility, representativeness, robustness, and synergy. Without requiring additional fine-tuning, HCS enables VLMs to achieve rapid understandings of unseen scenes at any scale using minimal interpretable regions while mitigating insufficient feature density. HCS is a plug-and-play method that is compatible with any VLM. Experiments demonstrate that HCS achieves superior performance and universality in various tasks.
