Table of Contents
Fetching ...

TopicFM+: Boosting Accuracy and Efficiency of Topic-Assisted Feature Matching

Khang Truong Giang, Soohwan Song, Sungho Jo

TL;DR

This study proposes a novel image-matching method that leverages a topic-modeling strategy to capture high-level contexts in images and achieves an approximately 50% reduction in computational costs compared to other Transformer-based methods.

Abstract

This study tackles the challenge of image matching in difficult scenarios, such as scenes with significant variations or limited texture, with a strong emphasis on computational efficiency. Previous studies have attempted to address this challenge by encoding global scene contexts using Transformers. However, these approaches suffer from high computational costs and may not capture sufficient high-level contextual information, such as structural shapes or semantic instances. Consequently, the encoded features may lack discriminative power in challenging scenes. To overcome these limitations, we propose a novel image-matching method that leverages a topic-modeling strategy to capture high-level contexts in images. Our method represents each image as a multinomial distribution over topics, where each topic represents a latent semantic instance. By incorporating these topics, we can effectively capture comprehensive context information and obtain discriminative and high-quality features. Additionally, our method effectively matches features within corresponding semantic regions by estimating the covisible topics. To enhance the efficiency of feature matching, we have designed a network with a pooling-and-merging attention module. This module reduces computation by employing attention only on fixed-sized topics and small-sized features. Through extensive experiments, we have demonstrated the superiority of our method in challenging scenarios. Specifically, our method significantly reduces computational costs while maintaining higher image-matching accuracy compared to state-of-the-art methods. The code will be updated soon at https://github.com/TruongKhang/TopicFM

TopicFM+: Boosting Accuracy and Efficiency of Topic-Assisted Feature Matching

TL;DR

This study proposes a novel image-matching method that leverages a topic-modeling strategy to capture high-level contexts in images and achieves an approximately 50% reduction in computational costs compared to other Transformer-based methods.

Abstract

This study tackles the challenge of image matching in difficult scenarios, such as scenes with significant variations or limited texture, with a strong emphasis on computational efficiency. Previous studies have attempted to address this challenge by encoding global scene contexts using Transformers. However, these approaches suffer from high computational costs and may not capture sufficient high-level contextual information, such as structural shapes or semantic instances. Consequently, the encoded features may lack discriminative power in challenging scenes. To overcome these limitations, we propose a novel image-matching method that leverages a topic-modeling strategy to capture high-level contexts in images. Our method represents each image as a multinomial distribution over topics, where each topic represents a latent semantic instance. By incorporating these topics, we can effectively capture comprehensive context information and obtain discriminative and high-quality features. Additionally, our method effectively matches features within corresponding semantic regions by estimating the covisible topics. To enhance the efficiency of feature matching, we have designed a network with a pooling-and-merging attention module. This module reduces computation by employing attention only on fixed-sized topics and small-sized features. Through extensive experiments, we have demonstrated the superiority of our method in challenging scenarios. Specifically, our method significantly reduces computational costs while maintaining higher image-matching accuracy compared to state-of-the-art methods. The code will be updated soon at https://github.com/TruongKhang/TopicFM
Paper Structure (25 sections, 24 equations, 9 figures, 8 tables)

This paper contains 25 sections, 24 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The main concept of our human-friendly topic-assisted feature matching, TopicFM. The method pools high-level context information from each image into a set of topics and then estimates a distribution over topics for each image feature. Consequently, the image is represented probabilistically with the topics marked in distinct colors. With this representation, TopicFM can efficiently identify overlapping regions by selecting covisible topics. By leveraging the selected topics and the unique information captured by each topic, TopicFM enhances the robustness of features, thus improving the matching accuracy.
  • Figure 2: Comparison between the proposed models (TopicFM-fast, TopicFM+) and SOTA Transformer-based methods on the MegaDepth dataset li2018megadepth. We report accuracy (AUC@$10^o$), runtime (ms), and computational cost (GFLOPs). The runtime and computational cost are measured at the image resolution of $1216 \times 1216$
  • Figure 3: The architecture of proposed image-matching method. We design a Feature Pyramid Network (FPN) that extracts feature maps at the low $\left(\frac{H}{8} \times \frac{W}{8}\right)$ and high $\left(\frac{H}{2} \times \frac{W}{2}\right)$ resolutions. Building upon this, we introduce a topic-assisted feature matching module (Section \ref{['method_topicfm']}) and a dynamic refinement network (Section \ref{['method_dynamic_refinement']}) to perform coarse-level and fine-level matching, respectively.
  • Figure 4: Details of the MLP-Mixer block. This block extracts new features by employing MLP layers on both the spatial side (token-mixing) and the channel side (channel-mixing) independently.
  • Figure 5: Visualization of dynamic refinement results. The keypoints estimated in the coarse stage are highlighted in red, while the refined keypoints from the fine stage are depicted in green. The visualized patches clearly illustrate the transformation of coarse keypoints from flat regions to peak points, demonstrating their effectiveness in enhancing the matching accuracy.
  • ...and 4 more figures