Table of Contents
Fetching ...

Text-to-Events: Synthetic Event Camera Streams from Conditional Text Input

Joachim Ott, Zuowen Wang, Shih-Chii Liu

TL;DR

This paper tackles the scarcity of labeled event-camera data by introducing a text-to-events pipeline that generates motion-rich event streams directly from text prompts. It combines a lightweight, iteratively trained autoencoder to produce sparse event frames with a diffusion model conditioned on text embeddings from a large language–video model, enabling end-to-end generation of realistic event sequences that are decoded into ON/OFF streams. The approach is evaluated on DVS gesture data, with pretraining on DAVIS 240C and diffusion training on the DVS128 gesture dataset, achieving 93.8% accuracy on real data and up to 62.8% on generated samples depending on the prompt and sampling, demonstrating the potential to synthesize useful event datasets. The work offers a practical path to scalable, text-driven event data generation, reducing reliance on slow real-data collection and paving the way for future joint intensity-and-event generation from text.

Abstract

Event cameras are advantageous for tasks that require vision sensors with low-latency and sparse output responses. However, the development of deep network algorithms using event cameras has been slow because of the lack of large labelled event camera datasets for network training. This paper reports a method for creating new labelled event datasets by using a text-to-X model, where X is one or multiple output modalities, in the case of this work, events. Our proposed text-to-events model produces synthetic event frames directly from text prompts. It uses an autoencoder which is trained to produce sparse event frames representing event camera outputs. By combining the pretrained autoencoder with a diffusion model architecture, the new text-to-events model is able to generate smooth synthetic event streams of moving objects. The autoencoder was first trained on an event camera dataset of diverse scenes. In the combined training with the diffusion model, the DVS gesture dataset was used. We demonstrate that the model can generate realistic event sequences of human gestures prompted by different text statements. The classification accuracy of the generated sequences, using a classifier trained on the real dataset, ranges between 42% to 92%, depending on the gesture group. The results demonstrate the capability of this method in synthesizing event datasets.

Text-to-Events: Synthetic Event Camera Streams from Conditional Text Input

TL;DR

This paper tackles the scarcity of labeled event-camera data by introducing a text-to-events pipeline that generates motion-rich event streams directly from text prompts. It combines a lightweight, iteratively trained autoencoder to produce sparse event frames with a diffusion model conditioned on text embeddings from a large language–video model, enabling end-to-end generation of realistic event sequences that are decoded into ON/OFF streams. The approach is evaluated on DVS gesture data, with pretraining on DAVIS 240C and diffusion training on the DVS128 gesture dataset, achieving 93.8% accuracy on real data and up to 62.8% on generated samples depending on the prompt and sampling, demonstrating the potential to synthesize useful event datasets. The work offers a practical path to scalable, text-driven event data generation, reducing reliance on slow real-data collection and paving the way for future joint intensity-and-event generation from text.

Abstract

Event cameras are advantageous for tasks that require vision sensors with low-latency and sparse output responses. However, the development of deep network algorithms using event cameras has been slow because of the lack of large labelled event camera datasets for network training. This paper reports a method for creating new labelled event datasets by using a text-to-X model, where X is one or multiple output modalities, in the case of this work, events. Our proposed text-to-events model produces synthetic event frames directly from text prompts. It uses an autoencoder which is trained to produce sparse event frames representing event camera outputs. By combining the pretrained autoencoder with a diffusion model architecture, the new text-to-events model is able to generate smooth synthetic event streams of moving objects. The autoencoder was first trained on an event camera dataset of diverse scenes. In the combined training with the diffusion model, the DVS gesture dataset was used. We demonstrate that the model can generate realistic event sequences of human gestures prompted by different text statements. The classification accuracy of the generated sequences, using a classifier trained on the real dataset, ranges between 42% to 92%, depending on the gesture group. The results demonstrate the capability of this method in synthesizing event datasets.
Paper Structure (20 sections, 12 equations, 7 figures, 2 tables)

This paper contains 20 sections, 12 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of the entire pipeline. From top to bottom: A text prompt is encoded via a pretrained large scale contrastive language-video model. This embedding is used as conditional input to a diffusion model. The output of this model is fed to a special trained sparse decoder, that outputs event frames representing time-binned event counts. Via Bernoulli sampling from these event frames ON and OFF event streams of a smooth gesture motion are constructed.
  • Figure 2: Data Preprocessing: Event streams from the datasets are split into smaller slices of a fixed time window length. Within these slices, a patch grid in the image dimension is overlaid. If the count of all ON and OFF events within a given patch is below a threshold these events get removed from the event stream. The filtered event stream is then split into slices of a fixed event count and each slice is converted to an time binned event frame in the format of a spatio-temporal voxel grid.
  • Figure 3: Iterative autoencoder (AE) training: Training starts at stage 1 with a low resolution of 8x8 pixels and the AE core is trained together with a adapter input and adapter output layer. These adapter layers map the input resolution to the input size of the next layer and last layer output to the output resolution (same as input resolution). In the next stage, the core layer parameters are frozen, the adapter layers get discarded, and new encoder and decoder layers are added, together with a new pair of adapter layers. The input event frames are $2\times$ the resolution of the previous training stage. This iterative process continues until the final $128\times128$ resolution is reached. An additional fine-tuning step with training on all parameters may be done as well.
  • Figure 4: Warm-up for full event occupancy during autoencoder training: To reduce the probability of a collapse of the decoder to only output zero events everywhere, we set a decreasing number of values in a training batch to the maximum value in said batch. In the first epoch, here depicted for a $128 \times 128$ image resolution, all values are set to the maximum. In epoch 2 an outer rim of each event frame in the form of a voxel grid stays the original values. The number of values set to the batch maximum value is further decreased in the following epochs. Starting from epoch 7, no voxels are set to the maximum value.
  • Figure 5: Sum of ON events over the first 100 spatio-temporal voxel grids. Ground-truth (left column) dataset samples, each with its class label, and generated event streams (right column) and the corresponding prompt used for generation. Our model correctly emphasizes event generation in the relevant motion trajectory areas.
  • ...and 2 more figures