Table of Contents
Fetching ...

ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models

Yixuan Hu, Yuxuan Xue, Simon Klenk, Daniel Cremers, Gerard Pons-Moll

TL;DR

ControlEvents tackles the data scarcity in event-camera research by introducing a diffusion-based generator that conditions on text, 2D skeletons, and 3D SMPL poses. It leverages diffusion priors from foundation models to synthesize high-quality, labeled event data with minimal fine-tuning, enabling zero-shot and few-shot generation. Across object recognition, 2D skeleton estimation, and 3D body recovery, synthetic events improve downstream performance and closely match real-event distributions, while enabling controllable data generation. The approach reduces labeling costs and opens up new avenues for text-to-motion and pose-conditioned event synthesis, with resources and datasets made publicly available.

Abstract

In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range. However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly. In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses. Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data. Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models.

ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models

TL;DR

ControlEvents tackles the data scarcity in event-camera research by introducing a diffusion-based generator that conditions on text, 2D skeletons, and 3D SMPL poses. It leverages diffusion priors from foundation models to synthesize high-quality, labeled event data with minimal fine-tuning, enabling zero-shot and few-shot generation. Across object recognition, 2D skeleton estimation, and 3D body recovery, synthetic events improve downstream performance and closely match real-event distributions, while enabling controllable data generation. The approach reduces labeling costs and opens up new avenues for text-to-motion and pose-conditioned event synthesis, with resources and datasets made publicly available.

Abstract

In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range. However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly. In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses. Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data. Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models.

Paper Structure

This paper contains 47 sections, 5 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 2: Comparison with Evant-GAN zhu2021eventgan and SMPL-ESIM xue2024elnr on 3D human-based events generation tasks. Our ControlEvents can effectively synthesize both events and noise, minimizing domain gap and closely matching the characteristics of real event data.
  • Figure 3: Overview of ControlEvents. For text-conditioned event data synthesis, we fine-tune Stable Diffusion rombach2021stablediffusion. For 2D & 3D pose-conditioned event data synthesis, we fine-tune ControlNet zhang2023controlnet using skeleton map or normal map. Our ControlEvents can synthesize large-scale dataset for various tasks.
  • Figure 4: Zero-shot generation of unseen text label. We also present the closest seen text label during training based on the CLIP cosine similarity.
  • Figure 5: SMPL-based generation on challenging unseen poses. Given challenging AMASS mahmood2019amass poses, we can generate realistic event data.
  • Figure 6: Ablation on foundational prior from Stable Diffusion rombach2021stablediffusion. Without the prior, our method cannot generate meaningful on unseen text.
  • ...and 8 more figures