Table of Contents
Fetching ...

Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models

Baoheng Zhang, Jiahui Liu, Gui Zhao, Weizhou Zhang, Yixuan Ma, Jun Jiang, Yingxian Chen, Wilton W. T. Fok, Xiaojuan Qi, Hayden Kwok-Hay So

Abstract

Multimodal Large Language Models (MLLMs) perform strong vision-language reasoning under standard conditions but fail in extreme illumination, where RGB inputs lose irrevocable structure and semantics. We propose Event-MLLM, an event-enhanced model that performs all-light visual reasoning by dynamically fusing event streams with RGB frames. Two key components drive our approach: an Illumination Indicator - a learnable signal derived from a DINOv2 branch that represents exposure degradation and adaptively modulates event-RGB fusion - and an Illumination Correction Loss that aligns fused features with non-degraded (normal-light) semantics in the latent space, compensating for information lost in extreme lighting. We curate the first multi-illumination event-instruction corpus for MLLMs, with 2,241 event-RGB samples (around 6 QA pairs each) across diverse scenes and 17 brightness rates (0.05x - 20x), plus an instruct-following benchmark for reasoning, counting, and fine-grained recognition under extreme lighting. Experiments show that Event-MLLM markedly outperforms general-purpose, illumination-adaptive, and event-only baselines, setting a new state of the art in robust multimodal perception and reasoning under challenging illumination.

Learning to See through Illumination Extremes with Event Streaming in Multimodal Large Language Models

Abstract

Multimodal Large Language Models (MLLMs) perform strong vision-language reasoning under standard conditions but fail in extreme illumination, where RGB inputs lose irrevocable structure and semantics. We propose Event-MLLM, an event-enhanced model that performs all-light visual reasoning by dynamically fusing event streams with RGB frames. Two key components drive our approach: an Illumination Indicator - a learnable signal derived from a DINOv2 branch that represents exposure degradation and adaptively modulates event-RGB fusion - and an Illumination Correction Loss that aligns fused features with non-degraded (normal-light) semantics in the latent space, compensating for information lost in extreme lighting. We curate the first multi-illumination event-instruction corpus for MLLMs, with 2,241 event-RGB samples (around 6 QA pairs each) across diverse scenes and 17 brightness rates (0.05x - 20x), plus an instruct-following benchmark for reasoning, counting, and fine-grained recognition under extreme lighting. Experiments show that Event-MLLM markedly outperforms general-purpose, illumination-adaptive, and event-only baselines, setting a new state of the art in robust multimodal perception and reasoning under challenging illumination.

Paper Structure

This paper contains 15 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Extreme lighting scenarios are demonstrated. When the normal (a) RGB frame can not be obtained, only (c) overexposed or (d) underexposed frame can be used as input. Existing baselines often produce illusions due to information degradation, while our method cleverly incorporates information from the (b) event frame, enabling the model to provide more accurate responses.
  • Figure 2: An overview of our proposed model is illustrated: (a) Features are extracted from extreme-light RGB frames by the original vision encoder and the DINOv2 vision encoder, respectively. These features are then fused with the features extracted from the event frame using our designed MLPs with our proposed Illumination Indicator ($F_{\text{illu}}$) to obtain fused features $F_{\text{fusion}}$ (Section \ref{['sec:met-fea']}). (b) Features are extracted from normal-light RGB frames and used for representation learning with the fused features using our proposed Illumination Correction Loss $\mathcal{L}_{\mathrm{IC}}$ (Section \ref{['sec:met-ill']}). (c) During training, the fused features participate in the fine-tuning of the entire MLLM, while (a) and (c) participate in inference (Section \ref{['sec:met-train']}).
  • Figure 3: Qualitative comparison under various illumination levels for the multi-choice task (top) and the object-counting task (bottom). Under severe underexposure and overexposure, the baseline model frequently produces incorrect or hallucinated predictions, whereas our illumination-guided event-enhanced model remains stable and accurate across all brightness ratios. These examples illustrate the robustness of our method in extreme-light conditions.
  • Figure 4: Visualization results of features before and after fusion with t-SNE. (a) Before fusion: event features and RGB features with different brightness ratios. (b) After fusion, fusion features under different brightness ratios cluster around the normal-light feature.