Table of Contents
Fetching ...

Learning to Generate Diverse Pedestrian Movements from Web Videos with Noisy Labels

Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou

TL;DR

A generative model called PedGen that contains a novel context encoder that lifts the 2D scene context to 3D and can incorporate various context factors in generating realistic pedestrian movements in urban scenes and achieves zero-shot generalization in both real-world and simulated environments.

Abstract

Understanding and modeling pedestrian movements in the real world is crucial for applications like motion forecasting and scene simulation. Many factors influence pedestrian movements, such as scene context, individual characteristics, and goals, which are often ignored by the existing human generation methods. Web videos contain natural pedestrian behavior and rich motion context, but annotating them with pre-trained predictors leads to noisy labels. In this work, we propose learning diverse pedestrian movements from web videos. We first curate a large-scale dataset called CityWalkers that captures diverse real-world pedestrian movements in urban scenes. Then, based on CityWalkers, we propose a generative model called PedGen for diverse pedestrian movement generation. PedGen introduces automatic label filtering to remove the low-quality labels and a mask embedding to train with partial labels. It also contains a novel context encoder that lifts the 2D scene context to 3D and can incorporate various context factors in generating realistic pedestrian movements in urban scenes. Experiments show that PedGen outperforms existing baseline methods for pedestrian movement generation by learning from noisy labels and incorporating the context factors. In addition, PedGen achieves zero-shot generalization in both real-world and simulated environments. The code, model, and data will be made publicly available at https://genforce.github.io/PedGen/ .

Learning to Generate Diverse Pedestrian Movements from Web Videos with Noisy Labels

TL;DR

A generative model called PedGen that contains a novel context encoder that lifts the 2D scene context to 3D and can incorporate various context factors in generating realistic pedestrian movements in urban scenes and achieves zero-shot generalization in both real-world and simulated environments.

Abstract

Understanding and modeling pedestrian movements in the real world is crucial for applications like motion forecasting and scene simulation. Many factors influence pedestrian movements, such as scene context, individual characteristics, and goals, which are often ignored by the existing human generation methods. Web videos contain natural pedestrian behavior and rich motion context, but annotating them with pre-trained predictors leads to noisy labels. In this work, we propose learning diverse pedestrian movements from web videos. We first curate a large-scale dataset called CityWalkers that captures diverse real-world pedestrian movements in urban scenes. Then, based on CityWalkers, we propose a generative model called PedGen for diverse pedestrian movement generation. PedGen introduces automatic label filtering to remove the low-quality labels and a mask embedding to train with partial labels. It also contains a novel context encoder that lifts the 2D scene context to 3D and can incorporate various context factors in generating realistic pedestrian movements in urban scenes. Experiments show that PedGen outperforms existing baseline methods for pedestrian movement generation by learning from noisy labels and incorporating the context factors. In addition, PedGen achieves zero-shot generalization in both real-world and simulated environments. The code, model, and data will be made publicly available at https://genforce.github.io/PedGen/ .

Paper Structure

This paper contains 29 sections, 1 equation, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Pedestrian Movement Generation. Our method can generate diverse pedestrian movements in real-world (top row) and simulated (bottom row) urban environments.
  • Figure 2: Samples in the CityWalker dataset. Top Left: The diverse pedestrian movements. Top Right: The diverse body shapes of the pedestrians. Bottom Left: The diverse surrounding environments from depth-unprojected images and the 4D pedestrian movement labels. Bottom Right: The diverse route destinations shown on the depth labels and semantic maps of the scene. The background showcases pedestrians in bustling cities from where we construct the CityWalker dataset.
  • Figure 3: Our method. We discard the anomaly labels with an iterative automatic label filtering procedure and add the partial labels to training data. We then train PedGen with a Context Encoder to represent crucial context factors. The scene context is obtained by lifting the 2D depth and semantic labels to the 3D space and converting them into a local voxel representation. The encoded scene context is combined with other context factors, including the body shape and the goal to get the context embedding $\boldsymbol{c}$. The context embedding $\boldsymbol{c}$ and the timestep embedding $\boldsymbol{k}$ are then used to guide the Denoising Transformer to predict the clean motion from the noised one. We use a learnable motion mask embedding $\boldsymbol{m}$ to address the partial labels during training.
  • Figure 4: Visualizations of the generated pedestrian movements. The top row shows results in real scenes from the CityWalkers dataset, the middle row shows results in the real-world Waymo test set, and the bottom row shows results in simulated scenes from the CARLA test set.
  • Figure 5: Qualitative comparison of training with context factors. We compare the generated movements of PedGen trained with or without context factors in real-world environments (a) and in simulation (b).
  • ...and 10 more figures