Table of Contents
Fetching ...

evMLP: An Efficient Event-Driven MLP Architecture for Vision

Zhentan Zheng

TL;DR

evMLP introduces an all-MLP vision model augmented with an event-driven local update that treats inter-frame changes as events and recomputes only patches where changes occur. By processing image patches independently and reusing computations for unchanged regions, it achieves competitive ImageNet accuracy (top-1 $=73.5\%$) at a compact $1.03$ GMACs and high inference throughput. In video scenarios, the event-driven mechanism yields average MAC reductions of $7$–$14\%$, with larger gains on stationary-camera data (over $25\%$ in some cases) and a tunable trade-off between efficiency and accuracy via the event threshold. These results suggest a practical, patch-level, MLP-based approach for real-time vision tasks, particularly in surveillance contexts where background stability is common.

Abstract

Deep neural networks have achieved remarkable results in computer vision tasks. In the early days, Convolutional Neural Networks (CNNs) were the mainstream architecture. In recent years, Vision Transformers (ViTs) have become increasingly popular. In addition, exploring applications of multi-layer perceptrons (MLPs) has provided new perspectives for research into vision model architectures. In this paper, we present evMLP accompanied by a simple event-driven local update mechanism. The proposed evMLP can independently process patches on images or feature maps via MLPs. We define changes between consecutive frames as ``events''. Under the event-driven local update mechanism, evMLP selectively processes patches where events occur. For sequential image data (e.g., video processing), this approach improves computational performance by avoiding redundant computations. Through ImageNet image classification experiments, evMLP attains accuracy competitive with state-of-the-art models. More significantly, experimental results on multiple video datasets demonstrate that evMLP reduces computational cost via its event-driven local update mechanism while maintaining output consistency with its non-event-driven baseline. The code and pre-trained models are available at https://github.com/i-evi/evMLP.

evMLP: An Efficient Event-Driven MLP Architecture for Vision

TL;DR

evMLP introduces an all-MLP vision model augmented with an event-driven local update that treats inter-frame changes as events and recomputes only patches where changes occur. By processing image patches independently and reusing computations for unchanged regions, it achieves competitive ImageNet accuracy (top-1 ) at a compact GMACs and high inference throughput. In video scenarios, the event-driven mechanism yields average MAC reductions of , with larger gains on stationary-camera data (over in some cases) and a tunable trade-off between efficiency and accuracy via the event threshold. These results suggest a practical, patch-level, MLP-based approach for real-time vision tasks, particularly in surveillance contexts where background stability is common.

Abstract

Deep neural networks have achieved remarkable results in computer vision tasks. In the early days, Convolutional Neural Networks (CNNs) were the mainstream architecture. In recent years, Vision Transformers (ViTs) have become increasingly popular. In addition, exploring applications of multi-layer perceptrons (MLPs) has provided new perspectives for research into vision model architectures. In this paper, we present evMLP accompanied by a simple event-driven local update mechanism. The proposed evMLP can independently process patches on images or feature maps via MLPs. We define changes between consecutive frames as ``events''. Under the event-driven local update mechanism, evMLP selectively processes patches where events occur. For sequential image data (e.g., video processing), this approach improves computational performance by avoiding redundant computations. Through ImageNet image classification experiments, evMLP attains accuracy competitive with state-of-the-art models. More significantly, experimental results on multiple video datasets demonstrate that evMLP reduces computational cost via its event-driven local update mechanism while maintaining output consistency with its non-event-driven baseline. The code and pre-trained models are available at https://github.com/i-evi/evMLP.

Paper Structure

This paper contains 12 sections, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: The event-driven local update mechanism. (a) and (b) are two consecutive frames with a resolution of $224\!\times\!224$. (c) presents the corresponding event map for a patch size of 7, where white regions denote activated events (the event calculation is detailed in Section \ref{['sec::method::evupdate']}). (d) is the current frame (b) masked by the event map (c), highlighting regions requiring recomputation.
  • Figure 2: (a) shows the overview of the evMLP's architecture. The image or feature map is divided into fixed-size patches, and then each patch is processed independently using the proposed MLP-based building blocks. (b) shows the structure of the proposed MLP-based building block, which consists of a fully connected layer followed by $n$ Inverted Residual Bottlenecks. In each Inverted Residual Bottleneck, an activation function is applied after the first fully-connected layer, while the output of the second fully-connected layer undergoes a residual connection with the input data before being processed by a Layer Normalization operation for final output.
  • Figure 3: Comparison of GPU memory consumption across different models under varying batch sizes. Inference was performed using NVIDIA TensorRTtrt.
  • Figure 4: Comparison of event maps generated with different event thresholds. (a), (b) are consecutive video frames. (c)-(e) are results of event maps generated using different event thresholds $\tau$ and superimposed on frame (b), brighter areas indicate patches where events occur.
  • Figure 5: Impact of event thresholds on accuracy and computational cost across datasets, $\tau \in \{0, 0.05, 0.1, 0.15\}$.