Table of Contents
Fetching ...

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

Shannan Yan, Leqi Zheng, Keyu Lv, Jingchen Ni, Hongyang Wei, Jiajun Zhang, Guangting Wang, Jing Lyu, Chun Yuan, Fengyun Rao

TL;DR

A simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video, to encourage robust, view-invariant representations.

Abstract

We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at https://github.com/shannany0606/CCMP.

Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction

TL;DR

A simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video, to encourage robust, view-invariant representations.

Abstract

We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at https://github.com/shannany0606/CCMP.
Paper Structure (52 sections, 6 equations, 10 figures, 14 tables)

This paper contains 52 sections, 6 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Cycle-Consistent Visual Correspondence with Test-Time Training. Our framework learns object-level correspondences by enforcing cycle consistency: the object mask is transferred from source to target view and projected back to reconstruct the original query. This self-supervised constraint enables robust cross-view alignment and supports test-time training to further improve performance during inference.
  • Figure 2: Model overview.$\mathit{CLS}$ denotes class tokens, and $\mathit{CDT}$ denotes condition tokens. The CLS head determines whether the object in the target image corresponding to a given object mask in the source image is visible. The bottom-left image shows the source image with the object mask, while the bottom-right image shows the target image.
  • Figure 3: Visualization illustrating the contribution of test-time training.
  • Figure 4: (a) Performance per activity scenario; (b) Performance across different object sizes in the target view.
  • Figure 5: Qualitative results on the Ego-Exo4D correspondence benchmark. Each row corresponds to one sample. From top to bottom, the first and second rows show samples of Ego2Exo, while the third and fourth rows show samples of Exo2Ego.
  • ...and 5 more figures