Large Language Models for Limited Noisy Data: A Gravitational Wave Identification Study
Yixuan Li, Yuhao Lu, Yang Liu, Liang Li, R. Ruffini, Di Li, Rong-Gen Cai, Xiaoyan Zhu, Wenbin Lin, Yu Wang
TL;DR
To address identifying gravitational wave signals in non-Gaussian, non-stationary detector noise with limited labeled data, the authors evaluate large language models trained directly on observational data. They convert time-series data into time-frequency patch tokens and fine-tune an 8B-parameter LLM (Meta-Llama-3-8B-Instruct) using LoRA, achieving 97.4% recall on held-out GW segments without simulated injections. They show that adding large simulated datasets provides negligible gains, while increasing model size yields predictable improvements that converge around 8B parameters; dataset size also boosts performance, with diminishing returns at large scales. The results imply LLMs can efficiently extract global, coherent patterns from complex astronomical data and may generalize to other domains with similar noise characteristics.
Abstract
This work investigates whether large language models (LLMs) offer advantages over traditional neural networks for astronomical data processing, in regimes with non-Gaussian, non-stationary noise and limited labeled samples. Gravitational wave observations provide an suitable test case, using only 90 LIGO events, finetuned LLMs achieve 97.4\% accuracy for identifying signals. Further experiments show that, in contrast to traditional networks that rely on large simulated datasets, additional simulated samples do not improve LLM performance, while scaling studies reveal predictable gains with increasing model size and dataset size. These results indicate that LLMs can extract discriminative structure directly from observational data and provide an efficient assessment for gravitational wave identification. The same strategy may extend to other astronomical domains with similar noise properties, such as radio or pulsar observations.
