Table of Contents
Fetching ...

ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark

Joanne Lin, Ruirui Lin, Yini Li, David Bull, Nantheera Anantrasirichai

TL;DR

ELVIS tackles the challenge of video instance segmentation in dark scenes by introducing an unsupervised synthetic low-light video pipeline, a calibration-free degradation profiler (VDP-Net) to estimate degradation parameters, and an enhancement decoder that disentangles degradations from content within VIS architectures. The framework enables domain adaptation of state-of-the-art VIS models to low-light scenarios, yielding up to 3.7 AP improvements on synthetic YouTube-VIS 2019 data and qualitative gains on real LMOT-S data. Key contributions include a physics-based, temporally-aware degradation model, unsupervised degradation profiling, and a novel enhancement head embedded in Mask2Former-based VIS models. This work advances practical low-light VIS by providing robust synthetic data generation, explicit degradation disentanglement, and demonstrable improvements across multiple backbones and datasets, with implications for autonomous driving, surveillance, and robotics in low-light environments.

Abstract

Video instance segmentation (VIS) for low-light content remains highly challenging for both humans and machines alike, due to adverse imaging conditions including noise, blur and low-contrast. The lack of large-scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low-light videos and, as a result, perform poorly even when finetuned on low-light data. In this paper, we introduce \textbf{ELVIS} (\textbf{E}nhance \textbf{L}ow-light for \textbf{V}ideo \textbf{I}nstance \textbf{S}egmentation), a novel framework that enables effective domain adaptation of state-of-the-art VIS models to low-light scenarios. ELVIS comprises an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile synthesis network (VDP-Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to \textbf{+3.7AP} on the synthetic low-light YouTube-VIS 2019 dataset. Code will be released upon acceptance.

ELVIS: Enhance Low-Light for Video Instance Segmentation in the Dark

TL;DR

ELVIS tackles the challenge of video instance segmentation in dark scenes by introducing an unsupervised synthetic low-light video pipeline, a calibration-free degradation profiler (VDP-Net) to estimate degradation parameters, and an enhancement decoder that disentangles degradations from content within VIS architectures. The framework enables domain adaptation of state-of-the-art VIS models to low-light scenarios, yielding up to 3.7 AP improvements on synthetic YouTube-VIS 2019 data and qualitative gains on real LMOT-S data. Key contributions include a physics-based, temporally-aware degradation model, unsupervised degradation profiling, and a novel enhancement head embedded in Mask2Former-based VIS models. This work advances practical low-light VIS by providing robust synthetic data generation, explicit degradation disentanglement, and demonstrable improvements across multiple backbones and datasets, with implications for autonomous driving, surveillance, and robotics in low-light environments.

Abstract

Video instance segmentation (VIS) for low-light content remains highly challenging for both humans and machines alike, due to adverse imaging conditions including noise, blur and low-contrast. The lack of large-scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low-light videos and, as a result, perform poorly even when finetuned on low-light data. In this paper, we introduce \textbf{ELVIS} (\textbf{E}nhance \textbf{L}ow-light for \textbf{V}ideo \textbf{I}nstance \textbf{S}egmentation), a novel framework that enables effective domain adaptation of state-of-the-art VIS models to low-light scenarios. ELVIS comprises an unsupervised synthetic low-light video pipeline that models both spatial and temporal degradations, a calibration-free degradation profile synthesis network (VDP-Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to \textbf{+3.7AP} on the synthetic low-light YouTube-VIS 2019 dataset. Code will be released upon acceptance.

Paper Structure

This paper contains 24 sections, 10 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Video Instance Segmentation (VIS) in real low-light conditions. Comparison of results using (left) GenVIS heo2023genvis and (middle) GenVIS+ELVIS. (Right) Quantitative performance comparison between the pre-trained VIS method, re-trained with synthetic low-light data, and the proposed ELVIS framework across different levels of difficulty.
  • Figure 2: Overview of the proposed ELVIS framework, which consists of two main components: (i) the unsupervised synthetic low-light pipeline and (ii) the augmented instance segmentation module. The synthetic low-light video pipeline (green panel) degrades clean videos $X^{high}$ using the low-light degradation model (yellow panel) and degradation parameters $\phi$ estimated by VDP-Net.
  • Figure 3: Visual comparison of the linear motion blur + Gaussian blur, the Multivariate Gaussian blur, and the difference between the two resulting images (intensities normalized to [-1, 1]). The blur kernels can be seen in the top-left of each blurred image.
  • Figure 4: Visual comparison of video instance segmentation results on the LMOT-S dataset using GenVIS heo2023genvis method with a ResNet-50 he2016resnet backbone finetuned on our synthetic data (top row) versus implementing our ELVIS framework (middle row), with ground truth (bottom row) for reference. The columns represent frames in the example video, sampled every 10 frames from time $t$, to show the tracking performances.
  • Figure 5: Qualitative analysis of the several synthetic pipelines against the frames from real low-light datasets (SDSD wang2021sdsd, DID fu2023did, LMOT wang2024lmot). The brightness and contrast in the bottom-left triangles of each patch were adjusted by 40% for better visibility.
  • ...and 4 more figures