Table of Contents
Fetching ...

Counting Through Occlusion: Framework for Open World Amodal Counting

Safaeid Hossain Arib, Rabeya Akter, Abdul Monaf Chowdhury, Md Jubair Ahmed Sourov, Md Mehedi Hasan

TL;DR

CountOCC addresses the core challenge of counting objects under occlusion in open-world settings by introducing a hierarchical Feature Reconstruction Module (FRM) and Visual Equivalence (VisEQ) supervision. FRM explicitly reconstructs occluded object features across pyramid levels using spatial context and semantic text-visual priors, while VisEQ enforces gradient-based attention consistency between occluded and unoccluded views. The method achieves state-of-the-art results on occlusion-augmented benchmarks FSC-147-OCC and CARPK-OCC and demonstrates strong cross-domain performance on CAPTURe-Real, validating robust amodal counting across varied visual domains. A rigorous evaluation framework and ablations reveal that explicit feature reconstruction combined with attention-level supervision is essential for reliable amodal counting in real-world cluttered environments.

Abstract

Object counting has achieved remarkable success on visible instances, yet state-of-the-art (SOTA) methods fail under occlusion, a pervasive challenge in real world deployment. This failure stems from a fundamental architectural limitation where backbone networks encode occluding surfaces rather than target objects, thereby corrupting the feature representations required for accurate enumeration. To address this, we present CountOCC, an amodal counting framework that explicitly reconstructs occluded object features through hierarchical multimodal guidance. Rather than accepting degraded encodings, we synthesize complete representations by integrating spatial context from visible fragments with semantic priors from text and visual embeddings, generating class-discriminative features at occluded locations across multiple pyramid levels. We further introduce a visual equivalence objective that enforces consistency in attention space, ensuring that both occluded and unoccluded views of the same scene produce spatially aligned gradient-based attention maps. Together, these complementary mechanisms preserve discriminative properties essential for accurate counting under occlusion. For rigorous evaluation, we establish occlusion-augmented versions of FSC 147 and CARPK spanning both structured and unstructured scenes. CountOCC achieves SOTA performance on FSC 147 with 26.72% and 20.80% MAE reduction over prior baselines under occlusion in validation and test, respectively. CountOCC also demonstrates exceptional generalization by setting new SOTA results on CARPK with 49.89% MAE reduction and on CAPTUREReal with 28.79% MAE reduction, validating robust amodal counting across diverse visual domains. Code will be released soon.

Counting Through Occlusion: Framework for Open World Amodal Counting

TL;DR

CountOCC addresses the core challenge of counting objects under occlusion in open-world settings by introducing a hierarchical Feature Reconstruction Module (FRM) and Visual Equivalence (VisEQ) supervision. FRM explicitly reconstructs occluded object features across pyramid levels using spatial context and semantic text-visual priors, while VisEQ enforces gradient-based attention consistency between occluded and unoccluded views. The method achieves state-of-the-art results on occlusion-augmented benchmarks FSC-147-OCC and CARPK-OCC and demonstrates strong cross-domain performance on CAPTURe-Real, validating robust amodal counting across varied visual domains. A rigorous evaluation framework and ablations reveal that explicit feature reconstruction combined with attention-level supervision is essential for reliable amodal counting in real-world cluttered environments.

Abstract

Object counting has achieved remarkable success on visible instances, yet state-of-the-art (SOTA) methods fail under occlusion, a pervasive challenge in real world deployment. This failure stems from a fundamental architectural limitation where backbone networks encode occluding surfaces rather than target objects, thereby corrupting the feature representations required for accurate enumeration. To address this, we present CountOCC, an amodal counting framework that explicitly reconstructs occluded object features through hierarchical multimodal guidance. Rather than accepting degraded encodings, we synthesize complete representations by integrating spatial context from visible fragments with semantic priors from text and visual embeddings, generating class-discriminative features at occluded locations across multiple pyramid levels. We further introduce a visual equivalence objective that enforces consistency in attention space, ensuring that both occluded and unoccluded views of the same scene produce spatially aligned gradient-based attention maps. Together, these complementary mechanisms preserve discriminative properties essential for accurate counting under occlusion. For rigorous evaluation, we establish occlusion-augmented versions of FSC 147 and CARPK spanning both structured and unstructured scenes. CountOCC achieves SOTA performance on FSC 147 with 26.72% and 20.80% MAE reduction over prior baselines under occlusion in validation and test, respectively. CountOCC also demonstrates exceptional generalization by setting new SOTA results on CARPK with 49.89% MAE reduction and on CAPTUREReal with 28.79% MAE reduction, validating robust amodal counting across diverse visual domains. Code will be released soon.

Paper Structure

This paper contains 27 sections, 17 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: The occlusion challenge in open-world amodal object counting. (a) Unoccluded scene with all instances visible. (b) The same scene with an occluder masking a subset of instances. (c) State-of-the-art methods fail to infer hidden instances, counting only the visible objects. (d) Our method, CountOCC, accurately performs amodal counting, correctly predicting the total count by reasoning about both visible and occluded instances.
  • Figure 2: The CountOCC architecture. Our framework integrates two complementary supervision mechanisms for robust amodal counting. FRM operates at each pyramid level to generate reconstructed features $\hat{\mathbf{Z}}_{occ}$ that replace corrupted occluded tokens. VisEQ enforces attention consistency by aligning gradient-based attention maps $\mathbf{G}_T$ and $\mathbf{G}_S$ from teacher and student networks across occluded and unoccluded views. Reconstructed features $\hat{\mathbf{Z}}_{occ}$ flow through feature enhancer $f_{\varphi}$ and cross-modality decoder $f_{\psi}$, producing density predictions $\textit{Count}_{vis}$ and $\textit{Count}_{occ}$ that aggregate to total count $\textit{Count}_{total}$.
  • Figure 3: Architecture of the Feature Reconstruction Module. FRM reconstructs occluded features through hierarchical attention fusion. Learnable queries $\mathbf{Q}_0$ initialized from occluded positions undergo self-attention to model inter-dependencies, then cross-attend to visible tokens $\mathbf{Z}_{vis}$ to aggregate spatial context, producing spatially-informed queries $\mathbf{Q}_{vis}$. These queries are further refined through cross-attention with fused text-visual embeddings $\mathbf{Z}_{v,t}$ to inject semantic guidance, producing conditioned features $\mathbf{Z}_{cond}$ that MLP transforms into class-discriminative reconstructed features $\hat{\mathbf{Z}}_S$ for occluded regions.
  • Figure 4: Overview of the Visual Equivalence supervision framework. VisEQ enforces attention consistency across occluded and unoccluded views through dual supervision. Teacher network $f_T$ processes original image $\mathbf{X}_I$ to generate attention map $\mathbf{G}_T$, while student network $f_s$ processes occluded image $\mathbf{X}_{occ}$ with reconstructed tokens $\tilde{\mathbf{Z}}$ to produce $\mathbf{G}_S$. Both leverage fused text-visual tokens $\mathbf{Z}_{v,t}$ for class-specific guidance. Attention similarity loss $\mathcal{L}_{sim}$ aligns $\mathbf{G}_T$ and $\mathbf{G}_S$ through $\ell_2$ and cosine metrics, while ROI consistency loss $\mathcal{L}_{cst}$ encourages high activation and low variance in confident regions, ensuring spatially consistent localization regardless of occlusion state.
  • Figure 5: Visualization of reconstructed features across network depths. Left column shows occluded input images. Remaining columns display t-SNE embeddings at three pyramid levels (256, 512, 1024 channels). We compare occluded features (red), ground truth features from unoccluded images (green), and reconstructed features (blue).
  • ...and 6 more figures