Table of Contents
Fetching ...

EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

Deheng Zhang, Yuqian Fu, Runyi Yang, Yang Miao, Tianwen Qian, Xu Zheng, Guolei Sun, Ajad Chhatkuli, Xuanjing Huang, Yu-Gang Jiang, Luc Van Gool, Danda Pani Paudel

TL;DR

EgoNight tackles a critical gap in egocentric vision by introducing the first nighttime benchmark with day–night aligned videos for egocentric VQA, plus auxiliary tasks. It combines synthetic and real data (EgoNight-Synthetic, EgoNight-Sofia, EgoNight-Oxford) and a three-stage day-augmented labeling pipeline to produce 3,658 high-quality QA pairs across 12 QA types. Experiments across multiple MLLMs reveal pronounced day–night performance gaps, with notable difficulty in both VQA and auxiliary tasks like day–night retrieval and nighttime depth estimation. The dataset and benchmarks aim to drive development of illumination-robust egocentric perception and reasoning, with data and code to be released for community use.

Abstract

Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.

EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

TL;DR

EgoNight tackles a critical gap in egocentric vision by introducing the first nighttime benchmark with day–night aligned videos for egocentric VQA, plus auxiliary tasks. It combines synthetic and real data (EgoNight-Synthetic, EgoNight-Sofia, EgoNight-Oxford) and a three-stage day-augmented labeling pipeline to produce 3,658 high-quality QA pairs across 12 QA types. Experiments across multiple MLLMs reveal pronounced day–night performance gaps, with notable difficulty in both VQA and auxiliary tasks like day–night retrieval and nighttime depth estimation. The dataset and benchmarks aim to drive development of illumination-robust egocentric perception and reasoning, with data and code to be released for community use.

Abstract

Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.

Paper Structure

This paper contains 36 sections, 3 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Overview of the EgoNight. EgoNight integrates diverse video sources spanning synthetic environments, real-world indoor and outdoor scenes, recorded under both daytime and nighttime conditions, with spatial and temporal alignment. It consists of three benchmarks: (i) egocentric VQA as the primary focus, (ii) day–night correspondence retrieval, and (iii) egocentric depth estimation, all targeting the challenges of low-light egocentric vision. The day–night alignment (illustrated on the right with VQA examples) enables rigorous analysis of illumination gaps in MLLMs.
  • Figure 2: EgoNight construction and EgoNight-VQA annotation. EgoNight integrates EgoNight-Synthetic, EgoNight-Sofia, and EgoNight-Oxford sources. Annotation is achieved via a novel three-stage day-augmented Auto QA generation pipeline with 300+ hours of human refinement, resulting in over 3600 high-quality QA pairs.
  • Figure 3: QA types with examples. The first eight are paired types, where the same question–answer applies to both day and night clips; the last four are unpaired, evaluated only at night. QA Types have various durations, with static or spatial tasks (e.g., 1 and 3) using short clips, while dynamic or temporal tasks (e.g., 4 and 5) use full videos.
  • Figure 4: Statistics of EgoNight-VQA benchmark. (a) Distribution of QA pairs across QA types and sources. (b) Video duration distribution. (c) Task difficulty levels cross scenarios. (d) Scenario coverage. (e) Illumination coverage.
  • Figure 5: Performance analysis of MLLMs on EgoNight-VQA. (a) Day–night performance gap across paired QA types, showing consistent degradation at night. (b) Nighttime performance across all 12 QA types.
  • ...and 12 more figures