Table of Contents
Fetching ...

Boltzmann Attention Sampling for Image Analysis with Small Objects

Theodore Zhao, Sid Kiblawi, Naoto Usuyama, Ho Hin Lee, Sam Preston, Hoifung Poon, Mu Wei

TL;DR

BoltzFormer tackles the challenge of segmenting extremely small objects in images by employing dynamic sparse attention via Boltzmann sampling, guided by an annealing temperature schedule that broadens exploration early and focuses later. It integrates a text-conditioned prior, an ensemble of latent queries, and a PiGMA-based mask aggregation to enable end-to-end, text-prompted segmentation with reduced attention computation. Across multiple biomedical and medical imaging datasets, BoltzFormer achieves substantial Dice-score gains (3–12 absolute points) over state-of-the-art promptable decoders while reducing self-attention costs by an order of magnitude, with pronounced benefits for objects smaller than 1% of image area. This work advances practical end-to-end segmentation for small, uncertain targets and suggests broad applicability in biomedical and vision tasks requiring precise localization of tiny structures.

Abstract

Detecting and segmenting small objects, such as lung nodules and tumor lesions, remains a critical challenge in image analysis. These objects often occupy less than 0.1% of an image, making traditional transformer architectures inefficient and prone to performance degradation due to redundant attention computations on irrelevant regions. Existing sparse attention mechanisms rely on rigid hierarchical structures, which are poorly suited for detecting small, variable, and uncertain object locations. In this paper, we propose BoltzFormer, a novel transformer-based architecture designed to address these challenges through dynamic sparse attention. BoltzFormer identifies and focuses attention on relevant areas by modeling uncertainty using a Boltzmann distribution with an annealing schedule. Initially, a higher temperature allows broader area sampling in early layers, when object location uncertainty is greatest. As the temperature decreases in later layers, attention becomes more focused, enhancing efficiency and accuracy. BoltzFormer seamlessly integrates into existing transformer architectures via a modular Boltzmann attention sampling mechanism. Comprehensive evaluations on benchmark datasets demonstrate that BoltzFormer significantly improves segmentation performance for small objects while reducing attention computation by an order of magnitude compared to previous state-of-the-art methods.

Boltzmann Attention Sampling for Image Analysis with Small Objects

TL;DR

BoltzFormer tackles the challenge of segmenting extremely small objects in images by employing dynamic sparse attention via Boltzmann sampling, guided by an annealing temperature schedule that broadens exploration early and focuses later. It integrates a text-conditioned prior, an ensemble of latent queries, and a PiGMA-based mask aggregation to enable end-to-end, text-prompted segmentation with reduced attention computation. Across multiple biomedical and medical imaging datasets, BoltzFormer achieves substantial Dice-score gains (3–12 absolute points) over state-of-the-art promptable decoders while reducing self-attention costs by an order of magnitude, with pronounced benefits for objects smaller than 1% of image area. This work advances practical end-to-end segmentation for small, uncertain targets and suggests broad applicability in biomedical and vision tasks requiring precise localization of tiny structures.

Abstract

Detecting and segmenting small objects, such as lung nodules and tumor lesions, remains a critical challenge in image analysis. These objects often occupy less than 0.1% of an image, making traditional transformer architectures inefficient and prone to performance degradation due to redundant attention computations on irrelevant regions. Existing sparse attention mechanisms rely on rigid hierarchical structures, which are poorly suited for detecting small, variable, and uncertain object locations. In this paper, we propose BoltzFormer, a novel transformer-based architecture designed to address these challenges through dynamic sparse attention. BoltzFormer identifies and focuses attention on relevant areas by modeling uncertainty using a Boltzmann distribution with an annealing schedule. Initially, a higher temperature allows broader area sampling in early layers, when object location uncertainty is greatest. As the temperature decreases in later layers, attention becomes more focused, enhancing efficiency and accuracy. BoltzFormer seamlessly integrates into existing transformer architectures via a modular Boltzmann attention sampling mechanism. Comprehensive evaluations on benchmark datasets demonstrate that BoltzFormer significantly improves segmentation performance for small objects while reducing attention computation by an order of magnitude compared to previous state-of-the-art methods.

Paper Structure

This paper contains 37 sections, 10 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: The overall architecture of BoltzFormer for end-to-end object detection and segmentation via a unified text prompt. BoltzFormer is a novel transform-based architecture that introduces a Boltzmann attention sampling module to dynamically propose sparse areas to focus cross-attention in each layer using a Boltzmann distribution. To account for uncertainty, which is especially high in earlier stage of computation, BoltzFormer starts with a high temperature in the first layer, which gradually cools down in subsequent layers. This is reminiscent of a reinforcement learning process, where exploration is favored in the initial layers (more sparse areas being sampled), and exploitation in later layers (focusing on a handful of most promising areas). The model takes image (upper left) and text prompt (lower left) as input, and outputs segmentation mask (upper right) for the object specified in the text prompt. Specifically, we use a standard image encoder to obtain multiscale visual features, including a high-resolution semantic map (upper middle). BoltzFormer starts with a set of latent queries that try to model the correct semantic for the prompted object in the image (middle left). In each layer, the query vectors are combined with the semantic map to produce a Boltzmann distribution over the image, which is then used to sample the sparse areas. They each attend exclusively to the visual features in the sampled area and update themselves (see \ref{['fig:attention']} for more details). The queries communicate with the text embeddings through self-attention after each Boltzmann attention sampling layer (center block). After the transformer layers, each query is combined with the image semantic map to generate a candidate predicted mask. The predictions are aggregated by a pixel grounded mask aggregation (PiGMA) module into the final mask prediction (upper right, see \ref{['fig:pixel']} for details).
  • Figure 2: Illustration of the Boltzmann attention sampling block (center block in Fig. \ref{['fig:model']}). The latent queries from the previous layer each goes through the MLP transformation (Eq. \ref{['eq-mlp']}) with dimension kept constant. Each transformed query vector takes dot product with all feature vectors on the semantic map, yielding scalars on the map. We use sigmoid to transform the scores into (0,1), and compute the Boltzmann distribution of temperature $\tau_\ell$ using Eq. \ref{['eq-boltz']}. We then draw from the distribution to sample the corresponding patches in the visual feature for $N$ trials with replacement. The query attends exclusively to the samples features and add to itself. We perform the same for all query vectors and apply layer normalization on them at the end.
  • Figure 3: The architecture of the PiGMA module. The module takes in the predictions from the final layer queries. The query ensemble prediction part (top row) simply averages the predictions and interpolates to higher resolution. The pixel grounded correction part (de)convolutes the predictions twice into higher resolution. In each convolution layer, we feed in the resized original image and concatenate on the channel dimension. $c$ is the intermediate convolution dimension. Finally, the query ensemble prediction and pixel grounded correction are averaged and passed through a sigmoid transformation to produce the pixel-wise probability mask prediction.
  • Figure 4: Examples of Boltzmann sampling from the intermediate layers during inference. For each image and text prompt, the queries only attend to the samples patches at each layer. The dark region is completely masked out in that layer. Boundaries of the ground truth object is marked in red.
  • Figure 5: Boltzmann sampling example for lung nodule in chest CT. The sample patches are bright with target region circled in red.
  • ...and 4 more figures