Table of Contents
Fetching ...

E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

Jiajun Zhai, Hao Shi, Shangwei Guo, Kailun Yang, Kaiwei Wang

Abstract

Robotic Vision-Language-Action (VLA) models generalize well for open-ended manipulation, but their perception is fragile under sensing-stage degradations such as extreme low light, motion blur, and black clipping. We present E-VLA, an event-augmented VLA framework that improves manipulation robustness when conventional frame-based vision becomes unreliable. Instead of reconstructing images from events, E-VLA directly leverages motion and structural cues in event streams to preserve semantic perception and perception-action consistency under adverse conditions. We build an open-source teleoperation platform with a DAVIS346 event camera and collect a real-world synchronized RGB-event-action manipulation dataset across diverse tasks and illumination settings. We also propose lightweight, pretrained-compatible event integration strategies and study event windowing and fusion for stable deployment. Experiments show that even a simple parameter-free fusion, i.e., overlaying accumulated event maps onto RGB images, could substantially improve robustness in dark and blur-heavy scenes: on Pick-Place at 20 lux, success increases from 0% (image-only) to 60% with overlay fusion and to 90% with our event adapter; under severe motion blur (1000 ms exposure), Pick-Place improves from 0% to 20-25%, and Sorting from 5% to 32.5%. Overall, E-VLA provides systematic evidence that event-driven perception can be effectively integrated into VLA models, pointing toward robust embodied intelligence beyond conventional frame-based imaging. Code and dataset will be available at https://github.com/JJayzee/E-VLA.

E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

Abstract

Robotic Vision-Language-Action (VLA) models generalize well for open-ended manipulation, but their perception is fragile under sensing-stage degradations such as extreme low light, motion blur, and black clipping. We present E-VLA, an event-augmented VLA framework that improves manipulation robustness when conventional frame-based vision becomes unreliable. Instead of reconstructing images from events, E-VLA directly leverages motion and structural cues in event streams to preserve semantic perception and perception-action consistency under adverse conditions. We build an open-source teleoperation platform with a DAVIS346 event camera and collect a real-world synchronized RGB-event-action manipulation dataset across diverse tasks and illumination settings. We also propose lightweight, pretrained-compatible event integration strategies and study event windowing and fusion for stable deployment. Experiments show that even a simple parameter-free fusion, i.e., overlaying accumulated event maps onto RGB images, could substantially improve robustness in dark and blur-heavy scenes: on Pick-Place at 20 lux, success increases from 0% (image-only) to 60% with overlay fusion and to 90% with our event adapter; under severe motion blur (1000 ms exposure), Pick-Place improves from 0% to 20-25%, and Sorting from 5% to 32.5%. Overall, E-VLA provides systematic evidence that event-driven perception can be effectively integrated into VLA models, pointing toward robust embodied intelligence beyond conventional frame-based imaging. Code and dataset will be available at https://github.com/JJayzee/E-VLA.

Paper Structure

This paper contains 34 sections, 6 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Middle: Image-based VLA models degrade under low-light and motion-blurred conditions, leading to failures in object detection and imprecise manipulation. We propose E-VLA, which fuses stable event visual cues with image features within the VLA pipeline, preserving reliable performance under adverse conditions. Left: Existing VLA models (e.g., SmolVLA shukor2025smolvla) are pretrained on large-scale text-image datasets and manipulation videos. We build a teleoperation system equipped with an event camera and collect synchronized RGB-event-action data for E-VLA training. Right: We evaluate our method across gradient low-light and motion-blurred scenes. E-VLA consistently achieves higher success rates than RGB baselines.
  • Figure 2: An overview of our proposed E-VLA framework. Our architecture integrates event-based visual sensing with RGB frames and proprioceptive robot states to generate control sequences. We investigate two fusion strategies: (1) a Hierarchical Event Adapter that injects event features into intermediate layers of a frozen ViT encoder through trainable fusion modules, and (2) an Overlay strategy that directly combines events with RGB images prior to encoding via SigLIP. The resulting fusion visual tokens are concatenated with language tokens and state tokens and then processed by a frozen LLM backbone, which conditions an Action Expert to produce normalized robot actions. Snowflakes and flames denote frozen and trainable parameters, respectively.
  • Figure 3: Middle: The visualization of the proposed dataset. Events are represented as colored frames following Sec. \ref{['subsec:win_and_repr']}. Left: Side and top views of our teleoperation platform based on LeRobot SO100 manipulator cadene2024lerobot and DAVIS346 event camera. Right: Above are the statistics of our dataset. The line chart below shows that even when the image signal rapidly decays with decreasing illumination, the event modality can still maintain a stable event rate ($\sim 87k$ events per second or KEPS).
  • Figure 4: Qualitative comparison of visual inputs under different illumination.
  • Figure S5: Grayscale distribution of images captured under different illumination levels. (a) Cumulative distribution functions (CDFs) of pixel grayscale values for images captured at $20$, $30$, $40$, $75$, $100$, and $200$ lux. The logarithmic x-axis highlights the differences in low-intensity regions. (b) Corresponding RGB images with their average grayscale values indicated.
  • ...and 3 more figures