Table of Contents
Fetching ...

A Fixed-Point Approach to Unified Prompt-Based Counting

Wei Lin, Antoni B. Chan

TL;DR

This paper establishes a comprehensive prompt-based counting framework capable of generating density maps for concerned objects indicated by various prompt types, such as box, point, and text, by converting prompts from different modalities into prompt masks without requiring training.

Abstract

Existing class-agnostic counting models typically rely on a single type of prompt, e.g., box annotations. This paper aims to establish a comprehensive prompt-based counting framework capable of generating density maps for concerned objects indicated by various prompt types, such as box, point, and text. To achieve this goal, we begin by converting prompts from different modalities into prompt masks without requiring training. These masks are then integrated into a class-agnostic counting methodology for predicting density maps. Furthermore, we introduce a fixed-point inference along with an associated loss function to improve counting accuracy, all without introducing new parameters. The effectiveness of this method is substantiated both theoretically and experimentally. Additionally, a contrastive training scheme is implemented to mitigate dataset bias inherent in current class-agnostic counting datasets, a strategy whose effectiveness is confirmed by our ablation study. Our model excels in prominent class-agnostic datasets and exhibits superior performance in cross-dataset adaptation tasks.

A Fixed-Point Approach to Unified Prompt-Based Counting

TL;DR

This paper establishes a comprehensive prompt-based counting framework capable of generating density maps for concerned objects indicated by various prompt types, such as box, point, and text, by converting prompts from different modalities into prompt masks without requiring training.

Abstract

Existing class-agnostic counting models typically rely on a single type of prompt, e.g., box annotations. This paper aims to establish a comprehensive prompt-based counting framework capable of generating density maps for concerned objects indicated by various prompt types, such as box, point, and text. To achieve this goal, we begin by converting prompts from different modalities into prompt masks without requiring training. These masks are then integrated into a class-agnostic counting methodology for predicting density maps. Furthermore, we introduce a fixed-point inference along with an associated loss function to improve counting accuracy, all without introducing new parameters. The effectiveness of this method is substantiated both theoretically and experimentally. Additionally, a contrastive training scheme is implemented to mitigate dataset bias inherent in current class-agnostic counting datasets, a strategy whose effectiveness is confirmed by our ablation study. Our model excels in prominent class-agnostic datasets and exhibits superior performance in cross-dataset adaptation tasks.
Paper Structure (16 sections, 23 equations, 6 figures, 6 tables)

This paper contains 16 sections, 23 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of prompt-based counting. This model takes prompts in various modalities, e.g., box, point, or text annotation to indicate the object of interest, and then predicts the distribution and count accordingly.
  • Figure 2: Our unified prompt-based counting framework. A CNN encoder generates image features, and a token is aggregated based on the provided prompt mask. Next, cross-attention is applied to generate density features, which are then decoded to produce the density map. Importantly, the density map can also be viewed as a prompt mask, implying the existence of a fixed point solution. This fixed point enables the utilization of a loop to enhance the output.
  • Figure 3: Comparison between cosine distance in (\ref{['eq:cosmask']}) and softmax strategy in (\ref{['eq:sfxmask']}) on text-prompt mask generation. The list shows the extracted concept dictionary via LLaMA-Adapter V2 and Spacy. The green text represents the user-provided text prompt, whose index is $k$ in (\ref{['eq:sfxmask']}).
  • Figure 4: Visualization between TFPOC and our model. TFPOC cannot handle extremely dense regions (white box).
  • Figure 5: MAE/MSE on the validation set during training.
  • ...and 1 more figures