Table of Contents
Fetching ...

CLIP-Optimized Multimodal Image Enhancement via ISP-CNN Fusion for Coal Mine IoVT under Uneven Illumination

Shuai Wang, Shihao Zhang, Jiaqi Wu, Zijian Tian, Wei Chen, Tongzhu Jin, Miaomiao Xue, Zehua Wang, Fei Richard Yu, Victor C. M. Leung

TL;DR

The paper addresses unsafe and unreliable imaging in underground coal mine IoVT systems caused by low and uneven illumination. It introduces a CLIP-guided multimodal optimization framework that trains without paired references and an ISP-CNN fusion architecture that performs two-stage image enhancement (global luminance then local detail refinement) suitable for edge devices. Key contributions include the CLIP-based linguistic-image pairing and cue refinement losses, and a lightweight ISP-CNN enhancement module that reduces artifacts while balancing performance and computation. Experiments on coal mine and public datasets demonstrate improved PSNR, SSIM, and VIF, plus favorable edge deployment metrics, indicating practical gains for real-time, safer monitoring in harsh mining environments.

Abstract

Clear monitoring images are crucial for the safe operation of coal mine Internet of Video Things (IoVT) systems. However, low illumination and uneven brightness in underground environments significantly degrade image quality, posing challenges for enhancement methods that often rely on difficult-to-obtain paired reference images. Additionally, there is a trade-off between enhancement performance and computational efficiency on edge devices within IoVT systems.To address these issues, we propose a multimodal image enhancement method tailored for coal mine IoVT, utilizing an ISP-CNN fusion architecture optimized for uneven illumination. This two-stage strategy combines global enhancement with detail optimization, effectively improving image quality, especially in poorly lit areas. A CLIP-based multimodal iterative optimization allows for unsupervised training of the enhancement algorithm. By integrating traditional image signal processing (ISP) with convolutional neural networks (CNN), our approach reduces computational complexity while maintaining high performance, making it suitable for real-time deployment on edge devices.Experimental results demonstrate that our method effectively mitigates uneven brightness and enhances key image quality metrics, with PSNR improvements of 2.9%-4.9%, SSIM by 4.3%-11.4%, and VIF by 4.9%-17.8% compared to seven state-of-the-art algorithms. Simulated coal mine monitoring scenarios validate our method's ability to balance performance and computational demands, facilitating real-time enhancement and supporting safer mining operations.

CLIP-Optimized Multimodal Image Enhancement via ISP-CNN Fusion for Coal Mine IoVT under Uneven Illumination

TL;DR

The paper addresses unsafe and unreliable imaging in underground coal mine IoVT systems caused by low and uneven illumination. It introduces a CLIP-guided multimodal optimization framework that trains without paired references and an ISP-CNN fusion architecture that performs two-stage image enhancement (global luminance then local detail refinement) suitable for edge devices. Key contributions include the CLIP-based linguistic-image pairing and cue refinement losses, and a lightweight ISP-CNN enhancement module that reduces artifacts while balancing performance and computation. Experiments on coal mine and public datasets demonstrate improved PSNR, SSIM, and VIF, plus favorable edge deployment metrics, indicating practical gains for real-time, safer monitoring in harsh mining environments.

Abstract

Clear monitoring images are crucial for the safe operation of coal mine Internet of Video Things (IoVT) systems. However, low illumination and uneven brightness in underground environments significantly degrade image quality, posing challenges for enhancement methods that often rely on difficult-to-obtain paired reference images. Additionally, there is a trade-off between enhancement performance and computational efficiency on edge devices within IoVT systems.To address these issues, we propose a multimodal image enhancement method tailored for coal mine IoVT, utilizing an ISP-CNN fusion architecture optimized for uneven illumination. This two-stage strategy combines global enhancement with detail optimization, effectively improving image quality, especially in poorly lit areas. A CLIP-based multimodal iterative optimization allows for unsupervised training of the enhancement algorithm. By integrating traditional image signal processing (ISP) with convolutional neural networks (CNN), our approach reduces computational complexity while maintaining high performance, making it suitable for real-time deployment on edge devices.Experimental results demonstrate that our method effectively mitigates uneven brightness and enhances key image quality metrics, with PSNR improvements of 2.9%-4.9%, SSIM by 4.3%-11.4%, and VIF by 4.9%-17.8% compared to seven state-of-the-art algorithms. Simulated coal mine monitoring scenarios validate our method's ability to balance performance and computational demands, facilitating real-time enhancement and supporting safer mining operations.

Paper Structure

This paper contains 14 sections, 12 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: The architecture of the coal mine distributed IoVT system, utilizing a brightness disparity enhancement method, is depicted as follows. After the original video is captured by the monitoring equipment, the system performs light intensity enhancement within the distributed IoVT framework. The enhanced video is then transmitted to the cloud server via data transmission platforms and switches. On the cloud server, the images are further processed for advanced tasks. The red arrow indicates the data download flow, while the green arrow signifies the data upload direction.
  • Figure 2: Structure of our algorithm. The linguistic-image pairing stage aims to optimize T$_{T}$, enabling it to differentiate between low-light and normal-light images. In the image enhancement stage, the optimized cue T$_{T}$ guides the luminance enhancement unit to perform image enhancement. Furthermore, all enhanced images generated in the previous stage are utilized to refine T$_{T}$, enhancing its ability to perceive luminance characteristics in local regions. Purple arrows indicate the computational flow during model training, dashed lines represent the optimization objectives for each loss function, and blue arrows depict the data flow during the inference stage.
  • Figure 3: The structure of our ISP-CNN fusion architecture is as follows. In the image enhancement module, hyperparameters are generated through convolutional mapping, while in the image detail processing module, a mapping map is produced to further refine and correct the output from the image enhancement module.
  • Figure 4: Algorithm performance based on images with different training labels is illustrated. From left to right, the images show the input, the inference results trained with labels of varying exposures, and the inference results trained with labels of differing semantic information but consistent brightness. A significant difference in enhancement effect is observed between models trained with different label images.
  • Figure 5: Enhancement results of our algorithm with and without the cue word refinement stage: The images, in sequence, are the input image, the algorithm-enhanced image, the algorithm-enhanced image without the cue refinement stage, and the reference image. The version of the algorithm without the cue refinement stage exhibits limited brightness enhancement in localized regions.
  • ...and 6 more figures