Table of Contents
Fetching ...

Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes

Keqi Chen, Vinkle Srivastav, Didier Mutter, Nicolas Padoy

TL;DR

Self-MVA presents a self-supervised, uncalibrated framework for robust cross-view person association in challenging scenes lacking identity labels or camera calibration. It combines an encoder-decoder that learns unified geometric-appearance embeddings, a cross-view synchronization pretext task, and two self-supervised linear constraints—multi-view re-projection and pairwise edge association—to reduce the solution space. The method achieves state-of-the-art results on WILDTRACK, MVOR, and SOLDIERS, and enables automatic identity annotation, improved multi-object tracking, and potential pose estimation workflows without annotations. This approach broadens the applicability of multi-view analysis to real-world, cluttered environments where appearance cues are insufficient and cameras are uncalibrated.

Abstract

Multi-view person association is a fundamental step towards multi-view analysis of human activities. Although the person re-identification features have been proven effective, they become unreliable in challenging scenes where persons share similar appearances. Therefore, cross-view geometric constraints are required for a more robust association. However, most existing approaches are either fully-supervised using ground-truth identity labels or require calibrated camera parameters that are hard to obtain. In this work, we investigate the potential of learning from synchronization, and propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations. Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. The model encodes each person's unified geometric and appearance features, and we train it by utilizing synchronization labels for supervision after applying Hungarian matching to bridge the gap between instance-wise and image-wise distances. To further reduce the solution space, we propose two types of self-supervised linear constraints: multi-view re-projection and pairwise edge association. Extensive experiments on three challenging public benchmark datasets (WILDTRACK, MVOR, and SOLDIERS) show that our approach achieves state-of-the-art results, surpassing existing unsupervised and fully-supervised approaches. Code is available at https://github.com/CAMMA-public/Self-MVA.

Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes

TL;DR

Self-MVA presents a self-supervised, uncalibrated framework for robust cross-view person association in challenging scenes lacking identity labels or camera calibration. It combines an encoder-decoder that learns unified geometric-appearance embeddings, a cross-view synchronization pretext task, and two self-supervised linear constraints—multi-view re-projection and pairwise edge association—to reduce the solution space. The method achieves state-of-the-art results on WILDTRACK, MVOR, and SOLDIERS, and enables automatic identity annotation, improved multi-object tracking, and potential pose estimation workflows without annotations. This approach broadens the applicability of multi-view analysis to real-world, cluttered environments where appearance cues are insufficient and cameras are uncalibrated.

Abstract

Multi-view person association is a fundamental step towards multi-view analysis of human activities. Although the person re-identification features have been proven effective, they become unreliable in challenging scenes where persons share similar appearances. Therefore, cross-view geometric constraints are required for a more robust association. However, most existing approaches are either fully-supervised using ground-truth identity labels or require calibrated camera parameters that are hard to obtain. In this work, we investigate the potential of learning from synchronization, and propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations. Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. The model encodes each person's unified geometric and appearance features, and we train it by utilizing synchronization labels for supervision after applying Hungarian matching to bridge the gap between instance-wise and image-wise distances. To further reduce the solution space, we propose two types of self-supervised linear constraints: multi-view re-projection and pairwise edge association. Extensive experiments on three challenging public benchmark datasets (WILDTRACK, MVOR, and SOLDIERS) show that our approach achieves state-of-the-art results, surpassing existing unsupervised and fully-supervised approaches. Code is available at https://github.com/CAMMA-public/Self-MVA.

Paper Structure

This paper contains 33 sections, 15 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Examples of different multi-view person association methods on the challenging WILDTRACK chavdarova2018wildtrack, MVOR srivastav2018mvor and SOLDIERS soldiers datasets. ViT-P3DE seo2023vit is fully-supervised, while Person Re-ID zhou2021learning and Self-MVA are unsupervised. Red and green identities mean incorrect and correct association respectively. Best viewed in color.
  • Figure 2: Examples of synchronized and non-synchronized image pairs in the WILDTRACK dataset chavdarova2018wildtrack.
  • Figure 3: The framework of the self-supervised learning framework (FC = Fully-Connected, LN = Layer Normalization). For each image with detected persons, we encode each person's appearance features using a person Re-ID model, map their geometric information to a unified geometric feature space using positional encodings and learnable camera embeddings, and then decode the original 2d position. Then, we construct a triplet for each anchor image by randomly selecting a negative sample and conducting metric learning after instance association and edge association. Best viewed in color.
  • Figure 4: Example of multi-view person association on the Shelf belagiannis20143d and Panoptic datasets joo2015panoptic.
  • Figure 5: Examples of the failure cases: (a) the person at the edge of the image is incorrectly associated with the person standing next to him; (b) the two overlapping persons are associated with each other; (c) the two persons that only appear in one view are incorrectly associated.
  • ...and 5 more figures