Table of Contents
Fetching ...

CNSv2: Probabilistic Correspondence Encoded Neural Image Servo

Anzhe Chen, Hongxiang Yu, Shuxin Li, Yuxi Chen, Zhongxiang Zhou, Wentao Sun, Rong Xiong, Yue Wang

TL;DR

CNSv2 tackles robustness gaps in visual servo caused by unreliable keypoint matching by introducing a probabilistic, multimodal correspondence conditioned neural policy. It employs a translation-equivariant probabilistic matching representation derived from foundation-model features, paired with a Transformer-based controller and velocity denormalization to decouple training from real-world intrinsics and scene scales. Key contributions include the resolution-agnostic anchoring of matching scores, a velocity denormalization strategy, and a hybrid control scheme that blends image-space and Cartesian trajectories, all demonstrated in simulation and real-world experiments with textureless and illumination-variant scenes. The approach achieves real-time performance and improved generalization across unseen scenes, offering a practical path toward robust visual servo in challenging environments.

Abstract

Visual servo based on traditional image matching methods often requires accurate keypoint correspondence for high precision control. However, keypoint detection or matching tends to fail in challenging scenarios with inconsistent illuminations or textureless objects, resulting significant performance degradation. Previous approaches, including our proposed Correspondence encoded Neural image Servo policy (CNS), attempted to alleviate these issues by integrating neural control strategies. While CNS shows certain improvement against error correspondence over conventional image-based controllers, it could not fully resolve the limitations arising from poor keypoint detection and matching. In this paper, we continue to address this problem and propose a new solution: Probabilistic Correspondence Encoded Neural Image Servo (CNSv2). CNSv2 leverages probabilistic feature matching to improve robustness in challenging scenarios. By redesigning the architecture to condition on multimodal feature matching, CNSv2 achieves high precision, improved robustness across diverse scenes and runs in real-time. We validate CNSv2 with simulations and real-world experiments, demonstrating its effectiveness in overcoming the limitations of detector-based methods in visual servo tasks.

CNSv2: Probabilistic Correspondence Encoded Neural Image Servo

TL;DR

CNSv2 tackles robustness gaps in visual servo caused by unreliable keypoint matching by introducing a probabilistic, multimodal correspondence conditioned neural policy. It employs a translation-equivariant probabilistic matching representation derived from foundation-model features, paired with a Transformer-based controller and velocity denormalization to decouple training from real-world intrinsics and scene scales. Key contributions include the resolution-agnostic anchoring of matching scores, a velocity denormalization strategy, and a hybrid control scheme that blends image-space and Cartesian trajectories, all demonstrated in simulation and real-world experiments with textureless and illumination-variant scenes. The approach achieves real-time performance and improved generalization across unseen scenes, offering a practical path toward robust visual servo in challenging environments.

Abstract

Visual servo based on traditional image matching methods often requires accurate keypoint correspondence for high precision control. However, keypoint detection or matching tends to fail in challenging scenarios with inconsistent illuminations or textureless objects, resulting significant performance degradation. Previous approaches, including our proposed Correspondence encoded Neural image Servo policy (CNS), attempted to alleviate these issues by integrating neural control strategies. While CNS shows certain improvement against error correspondence over conventional image-based controllers, it could not fully resolve the limitations arising from poor keypoint detection and matching. In this paper, we continue to address this problem and propose a new solution: Probabilistic Correspondence Encoded Neural Image Servo (CNSv2). CNSv2 leverages probabilistic feature matching to improve robustness in challenging scenarios. By redesigning the architecture to condition on multimodal feature matching, CNSv2 achieves high precision, improved robustness across diverse scenes and runs in real-time. We validate CNSv2 with simulations and real-world experiments, demonstrating its effectiveness in overcoming the limitations of detector-based methods in visual servo tasks.

Paper Structure

This paper contains 17 sections, 19 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: We utilize probabilistic correspondence of robust features from foundation model and use neural policy for control, endowing image servo with generalization, high precision and robustness to challenging scenes.
  • Figure 2: Overview of probabilistic matching conditioned neural policy. We use foundation vision models to extract robust coarse features for matching. We build resolution-agnostic and translation-equivariant representation of probabilistic matching, on which the neural controller is conditioned to predict the velocity control. Fine-grained features from CNN are also fused to capture the pixel-wise error to improve the servo precision.
  • Figure 3: Training pipeline. We use NVIDIA IsaacSim to render photo-realistic images. One simulation process collects data from uniformly sampled space. Another simulation process collects data with DAgger.
  • Figure 4: Examples of rendered images in simulation environment.
  • Figure 5: Translation-equivariant probabilistic matching representation enables faster convergence of training.
  • ...and 2 more figures