CPR++: Object Localization via Single Coarse Point Supervision

Xuehui Yu; Pengfei Chen; Kuiran Wang; Xumeng Han; Guorong Li; Zhenjun Han; Qixiang Ye; Jianbin Jiao

CPR++: Object Localization via Single Coarse Point Supervision

Xuehui Yu, Pengfei Chen, Kuiran Wang, Xumeng Han, Guorong Li, Zhenjun Han, Qixiang Ye, Jianbin Jiao

TL;DR

This work tackles semantic variance in point-based object localization by introducing CPR, which refines coarse point annotations into semantic centers using MIL over neighbourhoods. Building on CPR, CPR++ adds a dynamic, cascade-based sampling regime and variance regularization to handle multi-scale objects, achieving state-of-the-art results across COCO, DOTA, SeaPerson, and VOC without requiring strict annotation rules. The combination of MIL-driven refinement, adaptive region estimation, and cascade optimization significantly reduces training ambiguity and improves localization accuracy, particularly for larger objects. Overall, CPR and CPR++ demonstrate that algorithmic refinement of weak supervision can rival or surpass center-keypoint annotations, broadening the practicality of POL in diverse real-world settings.

Abstract

Point-based object localization (POL), which pursues high-performance object sensing under low-cost data annotation, has attracted increased attention. However, the point annotation mode inevitably introduces semantic variance due to the inconsistency of annotated points. Existing POL heavily rely on strict annotation rules, which are difficult to define and apply, to handle the problem. In this study, we propose coarse point refinement (CPR), which to our best knowledge is the first attempt to alleviate semantic variance from an algorithmic perspective. CPR reduces the semantic variance by selecting a semantic centre point in a neighbourhood region to replace the initial annotated point. Furthermore, We design a sampling region estimation module to dynamically compute a sampling region for each object and use a cascaded structure to achieve end-to-end optimization. We further integrate a variance regularization into the structure to concentrate the predicted scores, yielding CPR++. We observe that CPR++ can obtain scale information and further reduce the semantic variance in a global region, thus guaranteeing high-performance object localization. Extensive experiments on four challenging datasets validate the effectiveness of both CPR and CPR++. We hope our work can inspire more research on designing algorithms rather than annotation rules to address the semantic variance problem in POL. The dataset and code will be public at github.com/ucas-vg/PointTinyBenchmark.

CPR++: Object Localization via Single Coarse Point Supervision

TL;DR

Abstract

Paper Structure (49 sections, 18 equations, 14 figures, 11 tables, 4 algorithms)

This paper contains 49 sections, 18 equations, 14 figures, 11 tables, 4 algorithms.

Introduction
Related Work
Vision Tasks under Point Supervision
Vision Tasks with Multiple Instance Learning
Vision Tasks with Cascade Structure
Methodology
Overview
CPR
Point Sampling
CPR Training
CPR Inference
CPR++
Motivation
Sampling Region Estimation
Progressive Point Refinement
...and 34 more sections

Figures (14)

Figure 1: The motivation of CPR/CPR++. (a) Examples of coarse point annotation and the problem of semantic variance. (b) CPR aims to find a semantic center point of the objects belonging to the same category to reduce training ambiguity. (c) Limitation of CPR. Due to the leak of scale information, either small or large $r$ has its problem. A small radius leads to the local semantic point instead of the global solution. Large radius results in merging with other object regions. (d) The performance of different sampling radii in CPR. When the radius gets larger, the ${\rm mAP}$ of large objects increases while that of small objects decreases. The performance drops when the radius is too large due to the interference of adjacent objects.
Figure 2: Difficulty of key-point-based annotation. (a) Key points are hard to define due to the large in-class variance of shape. (b) Key point (e.g., head) does not exist due to multiple poses and views. (c) The key point's granularity (e.g., eye, forehead, head, and body) is hard to determine due to multiple scales.
Figure 3: Pipeline of coarse point-based localization (CPL). There are three steps: 1) Annotating objects as coarse points $A$; 2) Refining annotated points $A$ to semantic centers $\hat{A}$; 3) Training a localizer (e.g., P2PNet) with $\hat{A}$ as supervision. "×(K-1)" means the blue block repeats K-1 times and CPR++ becomes to CPR while K=1.
Figure 4: The framework of CPR and sampling region estimation module of CPR++. Given feature map $F$, annotated point $A$ (green), center $\hat{A}$ (red) and radius $\hat{R}$ of point sampling, there are three steps in the CPR stage: 1) Positive bags (i.e., $B_1, B_2$) and negative samples (i.e., $Neg_{k_c}$) are obtained by point sampling, and then feature vectors of these points are extracted on $F$. 2) Network is trained with the feature vectors based on CPR loss(MIL loss, annotation loss and negative loss) and variance loss. 3) Semantic points (red points on the birds) $B^+_1, B^+_2$ are selected by classification scores of points in the bag (i.e., $B_1, B_2$) predicted by the trained network (CPR Inference). 4) Finally, the refined points (yellow) $\hat{A}'$ are obtained by weighted averaging of the semantic points. For CPR++, the circumscribed rectangle of the semantic points is used to estimate the radius of the next stage (Sec. \ref{['sec: variance estimation']}) and the variance map $G^{var}$ are fed into the network to conduct variance regularization (Sec. \ref{['sec:variance-loss']}) for last CPR++ stage. (Best viewed in color).
Figure 5: The framework of CPR++. (1) For an image, a backbone is used to extract the shared feature map for all the CPR heads. (2) Given the annotation point as a center point and initial large radius for the sampling region, the positive points bag, and negative samples are constructed to train the first CPR head. The semantic points estimated by the CPR head are utilized to obtain the adaptive smaller sampling radius and the semantic center point by sampling region estimation module. (3) The dynamic sampling radius and semantic center point are utilized to format the new positive points bag and negative samples to train the next CPR head. Repeating these procedures $K$ times. (4) For the last ($K$-th) CPR head, variance regularization is conducted with the supervision of the variance map $G^{var}$ generated by the previous stage to reduce semantic variance further. It is worth mentioning that the initial annotation points are utilized in all heads to calculate annotation loss. CPR++ outputs the $\hat{A}^{K+1}$ as the final refined point to supervise the localizer.
...and 9 more figures

CPR++: Object Localization via Single Coarse Point Supervision

TL;DR

Abstract

CPR++: Object Localization via Single Coarse Point Supervision

Authors

TL;DR

Abstract

Table of Contents

Figures (14)