Table of Contents
Fetching ...

PReD: An LLM-based Foundation Multimodal Model for Electromagnetic Perception, Recognition, and Decision

Zehua Han, Jing Xiao, Yiqi Duan, Mengyu Xiang, Yuheng Ji, Xiaolong Zheng, Chenghanyu Zhang, Zhendong She, Junyu Shen, Dingwei Tan, Shichu Sun, Zhou Cong, Mingxuan Liu, Fengxiang Wang, Jinping Sun, Yangang Sun

Abstract

Multimodal Large Language Models have demonstrated powerful cross-modal understanding and reasoning capabilities in general domains. However, in the electromagnetic (EM) domain, they still face challenges such as data scarcity and insufficient integration of domain knowledge. This paper proposes PReD, the first foundation model for the EM domain that covers the intelligent closed-loop of "perception, recognition, decision-making." We constructed a high-quality multitask EM dataset, PReD-1.3M, and an evaluation benchmark, PReD-Bench. The dataset encompasses multi-perspective representations such as raw time-domain waveform, frequency-domain spectrograms, and constellation diagrams, covering typical features of communication and radar signals. It supports a range of core tasks, including signal detection, modulation recognition, parameter estimation, protocol recognition, radio frequency fingerprint recognition, and anti-jamming decision-making. PReD adopts a multi-stage training strategy that unifies multiple tasks for EM signals. It achieves closed-loop optimization from end-to-end signal understanding to language-driven reasoning and decision-making, significantly enhancing EM domain expertise while maintaining general multimodal capabilities. Experimental results show that PReD achieves state-of-the-art performance on PReD-Bench constructed from both open-source and self-collected signal datasets. These results collectively validate the feasibility and potential of vision-aligned foundation models in advancing the understanding and reasoning of EM signals.

PReD: An LLM-based Foundation Multimodal Model for Electromagnetic Perception, Recognition, and Decision

Abstract

Multimodal Large Language Models have demonstrated powerful cross-modal understanding and reasoning capabilities in general domains. However, in the electromagnetic (EM) domain, they still face challenges such as data scarcity and insufficient integration of domain knowledge. This paper proposes PReD, the first foundation model for the EM domain that covers the intelligent closed-loop of "perception, recognition, decision-making." We constructed a high-quality multitask EM dataset, PReD-1.3M, and an evaluation benchmark, PReD-Bench. The dataset encompasses multi-perspective representations such as raw time-domain waveform, frequency-domain spectrograms, and constellation diagrams, covering typical features of communication and radar signals. It supports a range of core tasks, including signal detection, modulation recognition, parameter estimation, protocol recognition, radio frequency fingerprint recognition, and anti-jamming decision-making. PReD adopts a multi-stage training strategy that unifies multiple tasks for EM signals. It achieves closed-loop optimization from end-to-end signal understanding to language-driven reasoning and decision-making, significantly enhancing EM domain expertise while maintaining general multimodal capabilities. Experimental results show that PReD achieves state-of-the-art performance on PReD-Bench constructed from both open-source and self-collected signal datasets. These results collectively validate the feasibility and potential of vision-aligned foundation models in advancing the understanding and reasoning of EM signals.

Paper Structure

This paper contains 38 sections, 2 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Overview of PReD and its capabilities, instantiated across six core electromagnetic (EM) tasks that span the three tiers of perception, recognition, and decision-making. The figure demonstrates how PReD processes multi-view signal visualizations to address a diverse range of EM instructions, from quantitative parameter estimation to open-ended anti-jamming strategy generation.
  • Figure 2: Overall pipeline for the construction of the PReD-1.3M training set and the PReD-Bench. We first generate five types of signal visualizations from the raw IQ signals. We then create corresponding OpenQA and MCQA pairs for each task type to form a complete instruction set. Finally, a portion of this set is sampled to obtain the PReD-Bench, which is kept strictly separate from the training data.
  • Figure 3: Scale and composition of the EM training dataset across six tasks and two QA formats.
  • Figure 4: Overall pipeline of PReD. (Bottom to top) Multi-view EM renderings are encoded and projected into the token space, where they are combined with tokenized instructions. A pre-trained LLM decoder is first aligned with visual features and then fine-tuned through a multi-stage curriculum to become the specialized PReD LLM Decoder. The model supports two output modes: open-ended generation (OpenQA) and structured multiple-choice question answering (MCQA).
  • Figure 5: Overview of our four-stage curriculum and data composition. The outer ring shows the training stages; the middle ring indicates data regimes (LCS, general single/multi-image); the inner ring lists EM tasks (SSD, SSE, MR, PR, EI, AJSD) emphasized in Stage 4. Percentages denote the proportion of samples used per stage.
  • ...and 5 more figures