Table of Contents
Fetching ...

SeMoLi: What Moves Together Belongs Together

Jenny Seidenschwarz, Aljoša Ošep, Francesco Ferroni, Simon Lucey, Laura Leal-Taixé

TL;DR

This work tackles semi-supervised object detection based on motion cues with outperforms prior heuristic-based approaches and shows that both, object detection, as well as motion-inspired pseudo-labeling, can be tackled in a data-driven manner.

Abstract

We tackle semi-supervised object detection based on motion cues. Recent results suggest that heuristic-based clustering methods in conjunction with object trackers can be used to pseudo-label instances of moving objects and use these as supervisory signals to train 3D object detectors in Lidar data without manual supervision. We re-think this approach and suggest that both, object detection, as well as motion-inspired pseudo-labeling, can be tackled in a data-driven manner. We leverage recent advances in scene flow estimation to obtain point trajectories from which we extract long-term, class-agnostic motion patterns. Revisiting correlation clustering in the context of message passing networks, we learn to group those motion patterns to cluster points to object instances. By estimating the full extent of the objects, we obtain per-scan 3D bounding boxes that we use to supervise a Lidar object detection network. Our method not only outperforms prior heuristic-based approaches (57.5 AP, +14 improvement over prior work), more importantly, we show we can pseudo-label and train object detectors across datasets.

SeMoLi: What Moves Together Belongs Together

TL;DR

This work tackles semi-supervised object detection based on motion cues with outperforms prior heuristic-based approaches and shows that both, object detection, as well as motion-inspired pseudo-labeling, can be tackled in a data-driven manner.

Abstract

We tackle semi-supervised object detection based on motion cues. Recent results suggest that heuristic-based clustering methods in conjunction with object trackers can be used to pseudo-label instances of moving objects and use these as supervisory signals to train 3D object detectors in Lidar data without manual supervision. We re-think this approach and suggest that both, object detection, as well as motion-inspired pseudo-labeling, can be tackled in a data-driven manner. We leverage recent advances in scene flow estimation to obtain point trajectories from which we extract long-term, class-agnostic motion patterns. Revisiting correlation clustering in the context of message passing networks, we learn to group those motion patterns to cluster points to object instances. By estimating the full extent of the objects, we obtain per-scan 3D bounding boxes that we use to supervise a Lidar object detection network. Our method not only outperforms prior heuristic-based approaches (57.5 AP, +14 improvement over prior work), more importantly, we show we can pseudo-label and train object detectors across datasets.
Paper Structure (26 sections, 5 equations, 6 figures, 12 tables)

This paper contains 26 sections, 5 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Towards learning to pseudo-label: We propose SeMoLi, a data-driven approach for segmenting moving instances in point clouds (top), that we utilize to learn to detect moving objects (O) in Lidar. We visually contrast SeMoLi to prior art, that tackles similar problem via density-based clustering (DBSCAN) najibi2022motion. We visualize the whole point cloud in purple, and dynamic points, used as input to our method and baseline to localize moving instances, in green. We color-code individual segmented instances. From left to right SeMoLi (i) segments objects even for sparse point clouds and suffers less from under-segmentation, (ii) is able to learn to filter noise from the filtered point cloud, (iii) leads to less over-segmentation, and (iv) generalizes better to different classes. Best seen in color, zoomed.
  • Figure 2: Segment Moving in Lidar for Pseudo-Labeling: We first preprocess the point cloud to remove static points and predict per-point trajectories on the filtered point cloud (preprocessing and trajectory prediction). Then, we extract velocity-based features from the trajectories and learn to cluster, i.e., segment points based on motion-patters using a Message-Passing Netowrk gilmer2017MPN in a fully data-driven manner (SeMoLi). From point segments, we extract bounding boxes and inflate them (extracting and inflating bounding boxes). Finally, we apply our approach on unlabeled Lidar streams to obtain pseudo-labels, that we use to train object detectors.
  • Figure 3: Train and validation splits: We conduct our experiments using Waymo training set, for which manual labels are available. We pre-fix two separate validation sets, one for validating pseudo-labels (val_pseudo), and one for end-model detector performance (val_det). We report performance on varying ratios $x$ for training SeMoLi (train_pseudo) and generating pseudo-labels for training our detector (train_det).
  • Figure 4: Comparison of different graph construction approaches: We compare three different graph construction approaches as initial hypothesis for SeMoLi: position-based, velocity-based and a combination of both, where we first build a graph based on position and then cut edges if node velocities are highly different. Blue points represent nodes i.e. points in the point cloud, edges the initial hypothesis of connection to be refined by our GNN. Position-based graph construction utilizes the inductive-bias of proximity, velocity-based hypothesis yields edges spanning the entire scene since points can potentially have a similar velocity if they are are far in space.
  • Figure 5: Visualizations of Pedestrian Boudning Boxes: We show visualizations of pedestrian clusters in our filtered point cloud, our extracted bounding boxes (red) as well as ground truth bounding boxes (green). We can see, that SeMoLi clusters points correctly, but the extracted bounding boxes are significantly smaller than their corresponding ground truth bounding boxes.
  • ...and 1 more figures