Boltzmann Attention Sampling for Image Analysis with Small Objects
Theodore Zhao, Sid Kiblawi, Naoto Usuyama, Ho Hin Lee, Sam Preston, Hoifung Poon, Mu Wei
TL;DR
BoltzFormer tackles the challenge of segmenting extremely small objects in images by employing dynamic sparse attention via Boltzmann sampling, guided by an annealing temperature schedule that broadens exploration early and focuses later. It integrates a text-conditioned prior, an ensemble of latent queries, and a PiGMA-based mask aggregation to enable end-to-end, text-prompted segmentation with reduced attention computation. Across multiple biomedical and medical imaging datasets, BoltzFormer achieves substantial Dice-score gains (3–12 absolute points) over state-of-the-art promptable decoders while reducing self-attention costs by an order of magnitude, with pronounced benefits for objects smaller than 1% of image area. This work advances practical end-to-end segmentation for small, uncertain targets and suggests broad applicability in biomedical and vision tasks requiring precise localization of tiny structures.
Abstract
Detecting and segmenting small objects, such as lung nodules and tumor lesions, remains a critical challenge in image analysis. These objects often occupy less than 0.1% of an image, making traditional transformer architectures inefficient and prone to performance degradation due to redundant attention computations on irrelevant regions. Existing sparse attention mechanisms rely on rigid hierarchical structures, which are poorly suited for detecting small, variable, and uncertain object locations. In this paper, we propose BoltzFormer, a novel transformer-based architecture designed to address these challenges through dynamic sparse attention. BoltzFormer identifies and focuses attention on relevant areas by modeling uncertainty using a Boltzmann distribution with an annealing schedule. Initially, a higher temperature allows broader area sampling in early layers, when object location uncertainty is greatest. As the temperature decreases in later layers, attention becomes more focused, enhancing efficiency and accuracy. BoltzFormer seamlessly integrates into existing transformer architectures via a modular Boltzmann attention sampling mechanism. Comprehensive evaluations on benchmark datasets demonstrate that BoltzFormer significantly improves segmentation performance for small objects while reducing attention computation by an order of magnitude compared to previous state-of-the-art methods.
