Table of Contents
Fetching ...

Independently Keypoint Learning for Small Object Semantic Correspondence

Hailong Jin, Huiying Li

TL;DR

This work formalizes Small Object Semantic Correspondence (SOSC) and proposes KBCNet, a framework that combines Cross-Scale Feature Alignment (CSFA) with an efficient center-pivot 4D convolutional decoder to produce robust semantic matches. A key contribution is the Keypoint Bounding box-centered Cropping (KBC) method, an inference-time preprocessing that enlarges and crops around small-object keypoints to reduce fusion and enable independent learning, functioning as a plug-in for other methods. Empirical results on PF-PASCAL, PF-WILLOW, and SPair-71k show state-of-the-art performance, with notable gains on SPair-71k (7.5% absolute), and ablations demonstrate the effectiveness of CSFA, KBC, and the center-pivot 4D decoder in improving both accuracy and efficiency.

Abstract

Semantic correspondence remains a challenging task for establishing correspondences between a pair of images with the same category or similar scenes due to the large intra-class appearance. In this paper, we introduce a novel problem called 'Small Object Semantic Correspondence (SOSC).' This problem is challenging due to the close proximity of keypoints associated with small objects, which results in the fusion of these respective features. It is difficult to identify the corresponding key points of the fused features, and it is also difficult to be recognized. To address this challenge, we propose the Keypoint Bounding box-centered Cropping (KBC) method, which aims to increase the spatial separation between keypoints of small objects, thereby facilitating independent learning of these keypoints. The KBC method is seamlessly integrated into our proposed inference pipeline and can be easily incorporated into other methodologies, resulting in significant performance enhancements. Additionally, we introduce a novel framework, named KBCNet, which serves as our baseline model. KBCNet comprises a Cross-Scale Feature Alignment (CSFA) module and an efficient 4D convolutional decoder. The CSFA module is designed to align multi-scale features, enriching keypoint representations by integrating fine-grained features and deep semantic features. Meanwhile, the 4D convolutional decoder, based on efficient 4D convolution, ensures efficiency and rapid convergence. To empirically validate the effectiveness of our proposed methodology, extensive experiments are conducted on three widely used benchmarks: PF-PASCAL, PF-WILLOW, and SPair-71k. Our KBC method demonstrates a substantial performance improvement of 7.5\% on the SPair-71K dataset, providing compelling evidence of its efficacy.

Independently Keypoint Learning for Small Object Semantic Correspondence

TL;DR

This work formalizes Small Object Semantic Correspondence (SOSC) and proposes KBCNet, a framework that combines Cross-Scale Feature Alignment (CSFA) with an efficient center-pivot 4D convolutional decoder to produce robust semantic matches. A key contribution is the Keypoint Bounding box-centered Cropping (KBC) method, an inference-time preprocessing that enlarges and crops around small-object keypoints to reduce fusion and enable independent learning, functioning as a plug-in for other methods. Empirical results on PF-PASCAL, PF-WILLOW, and SPair-71k show state-of-the-art performance, with notable gains on SPair-71k (7.5% absolute), and ablations demonstrate the effectiveness of CSFA, KBC, and the center-pivot 4D decoder in improving both accuracy and efficiency.

Abstract

Semantic correspondence remains a challenging task for establishing correspondences between a pair of images with the same category or similar scenes due to the large intra-class appearance. In this paper, we introduce a novel problem called 'Small Object Semantic Correspondence (SOSC).' This problem is challenging due to the close proximity of keypoints associated with small objects, which results in the fusion of these respective features. It is difficult to identify the corresponding key points of the fused features, and it is also difficult to be recognized. To address this challenge, we propose the Keypoint Bounding box-centered Cropping (KBC) method, which aims to increase the spatial separation between keypoints of small objects, thereby facilitating independent learning of these keypoints. The KBC method is seamlessly integrated into our proposed inference pipeline and can be easily incorporated into other methodologies, resulting in significant performance enhancements. Additionally, we introduce a novel framework, named KBCNet, which serves as our baseline model. KBCNet comprises a Cross-Scale Feature Alignment (CSFA) module and an efficient 4D convolutional decoder. The CSFA module is designed to align multi-scale features, enriching keypoint representations by integrating fine-grained features and deep semantic features. Meanwhile, the 4D convolutional decoder, based on efficient 4D convolution, ensures efficiency and rapid convergence. To empirically validate the effectiveness of our proposed methodology, extensive experiments are conducted on three widely used benchmarks: PF-PASCAL, PF-WILLOW, and SPair-71k. Our KBC method demonstrates a substantial performance improvement of 7.5\% on the SPair-71K dataset, providing compelling evidence of its efficacy.
Paper Structure (12 sections, 7 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 12 sections, 7 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: The visualization of input images for inference. The red points are the keypoints of the object, which can serve as either source or predicted target keypoints. The image is segmented into small regions using a $16\times 16$ window, typically aligned with the downsampling factor.
  • Figure 2: The overall architecture of the proposed framework. The framework comprises a convolutional backbone, a cross-scale feature alignment (CSFA) module, and an efficient 4D convolutional decoder. In the CSFA module, we use the second scale features as the query to integrate the other two scale features, respectively, and combine bilinear interpolation-aligned features. Subsequently, the aligned features are employed to compute 4D correlation maps with a similarity function. Finally, the obtained 4D correlation maps are input into the 4D convolutional decoder, which serves to adjust local matches.
  • Figure 3: Visualization of the improvements our KBC method on SCOT, DHPF, CHM, CATs, MMNet-FCN, as well as our KBCNet on SPair-71k dataset.
  • Figure 4: Small object visual correspondence generated by state-of-the-art methods, including CHM, CATs, MMNet-FCN and our proposed method. The first two rows of images represent the original matching results of these methods, and the last two rows represent the matching results using the KBC method.
  • Figure 5: More matching visualization results for our KBCNet. The red star indicates the location of the source keypoint, the blue dot indicates the predicted target keypoint location, and the green "x" indicates the location of the ground truth target keypoint.
  • ...and 1 more figures