Table of Contents
Fetching ...

WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications

Xin Li, Mengbing Liu, Li Wei, Jiancheng An, Mérouane Debbah, Chau Yuen

TL;DR

WirelessMathBench targets the core challenge of evaluating LLMs on domain-specific mathematical modeling in wireless communications. By compiling $587$ questions from $40$ peer-reviewed papers and organizing tasks from MCQs to progressively masked derivations and full equation completions, the benchmark tests both symbolic reasoning and physical feasibility under dimensional constraints. The study finds that while reasoning-enabled models offer measurable gains, fully reconstructing complex wireless equations remains very difficult, with top performers like DeepSeek-R1 achieving only $38.05\%$ average accuracy and $7.83\%$ on full equation completion. Public release of the dataset and evaluation toolkit aims to spur advances in domain-adaptive pre-training and robust, domain-aware AI for engineering applications in next-generation wireless networks.

Abstract

Large Language Models (LLMs) have achieved impressive results across a broad array of tasks, yet their capacity for complex, domain-specific mathematical reasoning-particularly in wireless communications-remains underexplored. In this work, we introduce WirelessMathBench, a novel benchmark specifically designed to evaluate LLMs on mathematical modeling challenges to wireless communications engineering. Our benchmark consists of 587 meticulously curated questions sourced from 40 state-of-the-art research papers, encompassing a diverse spectrum of tasks ranging from basic multiple-choice questions to complex equation completion tasks, including both partial and full completions, all of which rigorously adhere to physical and dimensional constraints. Through extensive experimentation with leading LLMs, we observe that while many models excel in basic recall tasks, their performance degrades significantly when reconstructing partially or fully obscured equations, exposing fundamental limitations in current LLMs. Even DeepSeek-R1, the best performer on our benchmark, achieves an average accuracy of only 38.05%, with a mere 7.83% success rate in full equation completion. By publicly releasing WirelessMathBench along with the evaluation toolkit, we aim to advance the development of more robust, domain-aware LLMs for wireless system analysis and broader engineering applications.

WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications

TL;DR

WirelessMathBench targets the core challenge of evaluating LLMs on domain-specific mathematical modeling in wireless communications. By compiling questions from peer-reviewed papers and organizing tasks from MCQs to progressively masked derivations and full equation completions, the benchmark tests both symbolic reasoning and physical feasibility under dimensional constraints. The study finds that while reasoning-enabled models offer measurable gains, fully reconstructing complex wireless equations remains very difficult, with top performers like DeepSeek-R1 achieving only average accuracy and on full equation completion. Public release of the dataset and evaluation toolkit aims to spur advances in domain-adaptive pre-training and robust, domain-aware AI for engineering applications in next-generation wireless networks.

Abstract

Large Language Models (LLMs) have achieved impressive results across a broad array of tasks, yet their capacity for complex, domain-specific mathematical reasoning-particularly in wireless communications-remains underexplored. In this work, we introduce WirelessMathBench, a novel benchmark specifically designed to evaluate LLMs on mathematical modeling challenges to wireless communications engineering. Our benchmark consists of 587 meticulously curated questions sourced from 40 state-of-the-art research papers, encompassing a diverse spectrum of tasks ranging from basic multiple-choice questions to complex equation completion tasks, including both partial and full completions, all of which rigorously adhere to physical and dimensional constraints. Through extensive experimentation with leading LLMs, we observe that while many models excel in basic recall tasks, their performance degrades significantly when reconstructing partially or fully obscured equations, exposing fundamental limitations in current LLMs. Even DeepSeek-R1, the best performer on our benchmark, achieves an average accuracy of only 38.05%, with a mere 7.83% success rate in full equation completion. By publicly releasing WirelessMathBench along with the evaluation toolkit, we aim to advance the development of more robust, domain-aware LLMs for wireless system analysis and broader engineering applications.

Paper Structure

This paper contains 51 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Example task from WirelessMathBench a system model derivation from wireless communications literature. The derivation progresses from a multiple-choice question to progressive mask completion questions, and finally to the full formula derivation, testing the model's ability to reason through complex channel reflections and matrix operations.
  • Figure 2: Overview of the data collection and annotation pipeline for WirelessMathBenchṪhe process involves selecting high-quality research papers, extracting system models from papers, curating tasks of varying complexity levels, and reviewing each task for clarity and correctness.
  • Figure 3: A word cloud illustrating the most frequent keywords in the WirelessMathBench benchmark, which reflects the range of wireless communication topics covered.
  • Figure 4: Error distribution among 40 annotated DeepSeek-R1 errors.
  • Figure 5: An example question and the corresponding output from LLMs for a multiple-choice task.
  • ...and 9 more figures