Table of Contents
Fetching ...

Unveiling the Power of Self-supervision for Multi-view Multi-human Association and Tracking

Wei Feng, Feifan Wang, Ruize Han, Zekun Qian, Song Wang

TL;DR

This work proposes to take advantage of the spatial-temporal self-consistency rationale by considering three properties of reflexivity, symmetry and transitivity, and designs the self-supervised learning losses based on the properties of symmetry and transitivity to associate the multiple humans over time and across views.

Abstract

Multi-view multi-human association and tracking (MvMHAT), is a new but important problem for multi-person scene video surveillance, aiming to track a group of people over time in each view, as well as to identify the same person across different views at the same time, which is different from previous MOT and multi-camera MOT tasks only considering the over-time human tracking. This way, the videos for MvMHAT require more complex annotations while containing more information for self learning. In this work, we tackle this problem with a self-supervised learning aware end-to-end network. Specifically, we propose to take advantage of the spatial-temporal self-consistency rationale by considering three properties of reflexivity, symmetry and transitivity. Besides the reflexivity property that naturally holds, we design the self-supervised learning losses based on the properties of symmetry and transitivity, for both appearance feature learning and assignment matrix optimization, to associate the multiple humans over time and across views. Furthermore, to promote the research on MvMHAT, we build two new large-scale benchmarks for the network training and testing of different algorithms. Extensive experiments on the proposed benchmarks verify the effectiveness of our method. We have released the benchmark and code to the public.

Unveiling the Power of Self-supervision for Multi-view Multi-human Association and Tracking

TL;DR

This work proposes to take advantage of the spatial-temporal self-consistency rationale by considering three properties of reflexivity, symmetry and transitivity, and designs the self-supervised learning losses based on the properties of symmetry and transitivity to associate the multiple humans over time and across views.

Abstract

Multi-view multi-human association and tracking (MvMHAT), is a new but important problem for multi-person scene video surveillance, aiming to track a group of people over time in each view, as well as to identify the same person across different views at the same time, which is different from previous MOT and multi-camera MOT tasks only considering the over-time human tracking. This way, the videos for MvMHAT require more complex annotations while containing more information for self learning. In this work, we tackle this problem with a self-supervised learning aware end-to-end network. Specifically, we propose to take advantage of the spatial-temporal self-consistency rationale by considering three properties of reflexivity, symmetry and transitivity. Besides the reflexivity property that naturally holds, we design the self-supervised learning losses based on the properties of symmetry and transitivity, for both appearance feature learning and assignment matrix optimization, to associate the multiple humans over time and across views. Furthermore, to promote the research on MvMHAT, we build two new large-scale benchmarks for the network training and testing of different algorithms. Extensive experiments on the proposed benchmarks verify the effectiveness of our method. We have released the benchmark and code to the public.
Paper Structure (33 sections, 36 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 36 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: An illustration of the proposed MvMHAT problem.
  • Figure 2: Overall framework of the proposed method, where we take three frames as an example. Specifically, we use the frames from different views and time and their human detection results as an input batch in the training stage. The framework consists of two parts: appearance feature learning module and assignment matrix learning module, for each of which we use the symmetric-consistency and transitive-consistency discussed in Section \ref{['sec:idea']} to construct the self-supervised loss for training.
  • Figure 3: An illustration of symmetric and transitive consistency rationale.
  • Figure 4: An illustration of the structure of STAN.
  • Figure 5: An illustration of the qualitative results. Figure (a) shows the tracking results of the MOT method Tracktor++ for one view, while figure (b) shows the association and tracking results of our method for multiple views. Figure (c) shows the results of our method on a group of multi-view videos in MvMHAT dataset. All the results we show are four sampled frames, which at the same column in each sub-figure are from the same time.