Table of Contents
Fetching ...

Learning to Find Missing Video Frames with Synthetic Data Augmentation: A General Framework and Application in Generating Thermal Images Using RGB Cameras

Mathias Viborg Andersen, Ross Greer, Andreas Møgelmose, Mohan Trivedi

TL;DR

This paper addresses the issue of missing data due to sensor frame rate mismatches, introducing a generative model approach to create synthetic yet realistic thermal imagery using conditional generative adversarial networks (cGANs), specifically comparing the pix2pix and CycleGAN architectures.

Abstract

Advanced Driver Assistance Systems (ADAS) in intelligent vehicles rely on accurate driver perception within the vehicle cabin, often leveraging a combination of sensing modalities. However, these modalities operate at varying rates, posing challenges for real-time, comprehensive driver state monitoring. This paper addresses the issue of missing data due to sensor frame rate mismatches, introducing a generative model approach to create synthetic yet realistic thermal imagery. We propose using conditional generative adversarial networks (cGANs), specifically comparing the pix2pix and CycleGAN architectures. Experimental results demonstrate that pix2pix outperforms CycleGAN, and utilizing multi-view input styles, especially stacked views, enhances the accuracy of thermal image generation. Moreover, the study evaluates the model's generalizability across different subjects, revealing the importance of individualized training for optimal performance. The findings suggest the potential of generative models in addressing missing frames, advancing driver state monitoring for intelligent vehicles, and underscoring the need for continued research in model generalization and customization.

Learning to Find Missing Video Frames with Synthetic Data Augmentation: A General Framework and Application in Generating Thermal Images Using RGB Cameras

TL;DR

This paper addresses the issue of missing data due to sensor frame rate mismatches, introducing a generative model approach to create synthetic yet realistic thermal imagery using conditional generative adversarial networks (cGANs), specifically comparing the pix2pix and CycleGAN architectures.

Abstract

Advanced Driver Assistance Systems (ADAS) in intelligent vehicles rely on accurate driver perception within the vehicle cabin, often leveraging a combination of sensing modalities. However, these modalities operate at varying rates, posing challenges for real-time, comprehensive driver state monitoring. This paper addresses the issue of missing data due to sensor frame rate mismatches, introducing a generative model approach to create synthetic yet realistic thermal imagery. We propose using conditional generative adversarial networks (cGANs), specifically comparing the pix2pix and CycleGAN architectures. Experimental results demonstrate that pix2pix outperforms CycleGAN, and utilizing multi-view input styles, especially stacked views, enhances the accuracy of thermal image generation. Moreover, the study evaluates the model's generalizability across different subjects, revealing the importance of individualized training for optimal performance. The findings suggest the potential of generative models in addressing missing frames, advancing driver state monitoring for intelligent vehicles, and underscoring the need for continued research in model generalization and customization.
Paper Structure (14 sections, 14 figures, 3 tables)

This paper contains 14 sections, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Many perspectives and modalities of data may contribute to robust driver state monitoring. Differing frame rates of sensors lead to an unavailability of "complete" sets of data from all modalities for a given instance. Because many driver states are best inferred from temporal patterns, an ideal data stream would have constant availability of all sources at each instance. Without such a stream, models may be limited to instance inference (blue), complete-but-temporally-distant sequences (red), or incomplete-but-temporally-local sequences (yellow). By generating missing data, we can provide synthetic but useful representations to fill in these gray gaps, enabling accurate downstream state estimation models using pseudo-complete, temporally-local sequences.
  • Figure 2: When sensors operate at different rates, it is possible that the temporally-nearest measurement to a given instance may have taken place before a significant action for one sensor, and after the action for another. In the above example, the driver has abruptly moved his hands closer to the wheel; however, the thermal camera has not yet processed another signal to capture this motion. So, if both "most recent" signals are sent to a multimodal model meant to estimate a driver's takeover readiness (e.g. proximity of hands to the steering wheel), the model would have a large amount of uncertainty from modal disagreement.
  • Figure 3: The flow of pix2pix applied in this work.
  • Figure 4: Example images showcasing perspectives captured from used cameras within our simulator setup.
  • Figure 6: From top to bottom, we show the generator output at different iterations of training. Originally, the generator produces a random image, and refines its output to match the intended thermal image over time. These images are separated by only 10 iterations each, except for the final image which represents a jump to 20,000 iterations.
  • ...and 9 more figures