Table of Contents
Fetching ...

JW-VL: A Vision-Language Model for Solar Physics

Mingfu Shao, Hui Wang, Liyue Tong, Yuyang Li, Cunshi Wang, Jiaben Lin, Suo Liu, Haiqing Xu, Yin Zhang, Jing Huang

Abstract

Vision-Language Models (VLMs) have achieved breakthrough progress in general knowledge domains, yet adaptation to specialized scientific fields remains challenging due to multimodal representation shifts and the limited integration of domain-specific knowledge. To address the limitations of general-purpose VLMs when applied to solar physics image recognition, analysis, and reasoning, we propose JinWu Vision-Language (JW-VL), a fine-tuned foundation model tailored for solar physics. The model integrates multi-wavelength observational data from both space-based and ground-based telescopes, encompassing representative spectral bands spanning the photosphere, chromosphere, and corona. Built upon a cross-modal alignment knowledge distillation framework, JW-VL learns a joint visual-semantic embedding that enables end-to-end modeling from raw solar observational data to downstream tasks, including solar image recognition, solar activity analysis via image-based question answering, and optical character recognition (OCR), while also supporting the construction of a multi-band, cross-instrument solar image benchmark dataset. Furthermore, as a demonstration of interdisciplinary applicability, we developed a "Daily Solar Activity Reports" agent comprising core modules for solar activity level assessment, significant active region characterization, magnetic field complexity analysis, potential space weather impact assessment, and identifying active regions for targeted observation. While JW-VL may not yet meet the rigorous, high-precision demands of operational solar physics, it bridges raw observations and diverse downstream tasks, establishing a valuable methodological framework for applying multimodal deep learning to the field.

JW-VL: A Vision-Language Model for Solar Physics

Abstract

Vision-Language Models (VLMs) have achieved breakthrough progress in general knowledge domains, yet adaptation to specialized scientific fields remains challenging due to multimodal representation shifts and the limited integration of domain-specific knowledge. To address the limitations of general-purpose VLMs when applied to solar physics image recognition, analysis, and reasoning, we propose JinWu Vision-Language (JW-VL), a fine-tuned foundation model tailored for solar physics. The model integrates multi-wavelength observational data from both space-based and ground-based telescopes, encompassing representative spectral bands spanning the photosphere, chromosphere, and corona. Built upon a cross-modal alignment knowledge distillation framework, JW-VL learns a joint visual-semantic embedding that enables end-to-end modeling from raw solar observational data to downstream tasks, including solar image recognition, solar activity analysis via image-based question answering, and optical character recognition (OCR), while also supporting the construction of a multi-band, cross-instrument solar image benchmark dataset. Furthermore, as a demonstration of interdisciplinary applicability, we developed a "Daily Solar Activity Reports" agent comprising core modules for solar activity level assessment, significant active region characterization, magnetic field complexity analysis, potential space weather impact assessment, and identifying active regions for targeted observation. While JW-VL may not yet meet the rigorous, high-precision demands of operational solar physics, it bridges raw observations and diverse downstream tasks, establishing a valuable methodological framework for applying multimodal deep learning to the field.

Paper Structure

This paper contains 13 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: The limitations of the QVQ-Max and Gemini 2.5 Pro models in interpreting specialized solar magnetogram. Both responses misinterpret the image as an intensity-based photospheric observation rather than a line-of-sight magnetogram. As a result, magnetic field signals are incorrectly described in terms of brightness, temperature, sunspots, and plage, leading to physical mischaracterization of the observed structures.
  • Figure 2: Overview of the pipeline for constructing the CoT-SFT vision--language fine-tuning dataset, including multi-source solar data integration, expert-designed prompt templates, and iterative knowledge distillation.
  • Figure 3: The Architecture of the JW-VL model, illustrating the integration of a vision encoder, cross-modal alignment, and a Transformer-based language decoder fine-tuned for solar physics tasks. The JinWu logo—named for the mythical three-legged bird (JinWu) from ancient Chinese folklore that symbolizes the sun—synthesizes three distinct cultural and scientific elements: The ancient Chinese seal script character: ""; the golden Sun Bird motif -- a culturally significant artifact unearthed at the Jinsha Site (circa 1200-650 BCE) representing solar worship; SDO's Atmospheric Imaging Assembly (AIA) 171 Å band imagery.
  • Figure 4: Representative interaction examples illustrating the qualitative performance of the JW-VL model across heterogeneous solar observational data and task types. The examples include recognition and physical interpretation of solar magnetograms and chromospheric images, image-based scientific question answering, identification of solar activity features, and metadata extraction via OCR.
  • Figure 5: The workflow of the daily solar activity reports agent used as an application example of the JW-VL model. The agent performs automated data acquisition, multimodal interpretation of solar observations, and daily report generation.
  • ...and 2 more figures