LLM-EvRep: Learning an LLM-Compatible Event Representation Using a Self-Supervised Framework
Zongyou Yu, Qiang Qu, Qian Zhang, Nan Zhang, Xiaoming Chen
TL;DR
This work tackles the challenge of applying large language models to event-based vision by learning an LLM-compatible event representation (LLM-EvRep) through a self-supervised framework. It introduces LLM-EvGen, an encoder-decoder network with EfficientNetV2 MBConv blocks that maps discretized event streams into LLM-friendly representations, guided by a dual alignment objective: semantic consistency with RGB frames and structural fidelity via Sobel-edge cues. The semantic alignment is enforced by an LLM-driven loss based on token-level similarity, while the structural fidelity loss preserves spatial details, collectively enabling the LLMs to perform zero-shot event recognition more effectively. Empirical results on N-ImageNet, N-Caltech101, and N-MNIST show that LLM-EvRep outperforms traditional event-to-video and CLIP-based baselines across both open- and closed-model evaluations, highlighting a practical path to integrate event data with powerful language models for real-time, scalable recognition tasks.
Abstract
Recent advancements in event-based recognition have demonstrated significant promise, yet most existing approaches rely on extensive training, limiting their adaptability for efficient processing of event-driven visual content. Meanwhile, large language models (LLMs) have exhibited remarkable zero-shot capabilities across diverse domains, but their application to event-based visual recognition remains largely unexplored. To bridge this gap, we propose \textbf{LLM-EvGen}, an event representation generator that produces LLM-compatible event representations \textbf{LLM-EvRep}, thereby enhancing the performance of LLMs on event recognition tasks. The generator is trained using a self-supervised framework, aligning the generated representations with semantic consistency and structural fidelity. Comprehensive experiments were conducted on three datasets: N-ImageNet, N-Caltech101, and N-MNIST. The results demonstrate that our method, \textbf{LLM-EvRep}, outperforms the event-to-video method, E2VID, by 15.93\%, 0.82\%, and 50.21\%, respectively, in recognition tasks when evaluated using GPT-4o.
